[Xapian-discuss] bigrams search speed and index documents

Ying Liu liux0395 at umn.edu
Wed Nov 4 17:03:12 GMT 2009


Hello again,

I am working on a pretty fast computer, Dell Optiplex 960. The memory is:
                                total        used             free     
shared    buffers     cached
Mem:               3094868    2943068       151800          0       
329468    1590012
-/+ buffers/cache:              1023588     2071280
Swap:                 9060620     76792     8983828

The cpu is:
00:00.0 Host bridge: Intel Corporation 4 Series Chipset DRAM Controller 
(rev 03)

The two files which contain more than 12000+ pieces of news are totally 
about 17MB.

My college is doing the same test by Lemur and her searching speed for 
bigrams is about 10 times than Xapian, and our machine is the same. (the 
speed to build the index is both very fast. ) I think there must be some 
thing I can improve with the way I build the index. Usually, how do you 
build the index? what's the more efficient way?

Thank you,
Ying


Ying Liu wrote:
> Hello all,
>
> I am using Xapian to index two XML files. In each file, there are 
> about 6000+ pieces of news. Each piece of news is separated by <DOC> 
> </DOC>. The way I build the index is:
>
> 1) read the XML file line by line, get one piece of news's head, date, 
> and contents which are separated by tags
> 2) remove  numbers, change to lower case,  remove stop words , and the 
> information is saved in $buf
> 3) new a Xapian::Document $doc, and use the TermGenerator to 
> set_document($doc) and index_text($buf).
> 4) add the $doc to the database $db
>
> For the next piece of news, repeat the above 1 to 3 steps. The average 
> length of each news is about 200 terms. The index is very fast, about 
> one to two minutes. My question is about the searching speed. I need 
> to find the bigrams of indexed documents, i.e., find any two term's 
> common postinglist and their positionlist in the same document. I 
> found the speed is kind of low, about 1562 bigrams/hour.
>
> My question is, is it an efficient way to build the index? If I do the 
> above step 1 and 2, and save the results into one separate file, can I 
> speed up the searching speed? Can I index a file directly instead of 
> TermGenerator? In a previous post, 
> http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html, 
> it mentioned  tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed 
> up the searching speed?
>
> Thank you,
> -Ying
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss




More information about the Xapian-discuss mailing list