[Xapian-discuss] bigrams search speed and index documents
Ying Liu
liux0395 at umn.edu
Wed Nov 4 17:03:12 GMT 2009
Hello again,
I am working on a pretty fast computer, Dell Optiplex 960. The memory is:
total used free
shared buffers cached
Mem: 3094868 2943068 151800 0
329468 1590012
-/+ buffers/cache: 1023588 2071280
Swap: 9060620 76792 8983828
The cpu is:
00:00.0 Host bridge: Intel Corporation 4 Series Chipset DRAM Controller
(rev 03)
The two files which contain more than 12000+ pieces of news are totally
about 17MB.
My college is doing the same test by Lemur and her searching speed for
bigrams is about 10 times than Xapian, and our machine is the same. (the
speed to build the index is both very fast. ) I think there must be some
thing I can improve with the way I build the index. Usually, how do you
build the index? what's the more efficient way?
Thank you,
Ying
Ying Liu wrote:
> Hello all,
>
> I am using Xapian to index two XML files. In each file, there are
> about 6000+ pieces of news. Each piece of news is separated by <DOC>
> </DOC>. The way I build the index is:
>
> 1) read the XML file line by line, get one piece of news's head, date,
> and contents which are separated by tags
> 2) remove numbers, change to lower case, remove stop words , and the
> information is saved in $buf
> 3) new a Xapian::Document $doc, and use the TermGenerator to
> set_document($doc) and index_text($buf).
> 4) add the $doc to the database $db
>
> For the next piece of news, repeat the above 1 to 3 steps. The average
> length of each news is about 200 terms. The index is very fast, about
> one to two minutes. My question is about the searching speed. I need
> to find the bigrams of indexed documents, i.e., find any two term's
> common postinglist and their positionlist in the same document. I
> found the speed is kind of low, about 1562 bigrams/hour.
>
> My question is, is it an efficient way to build the index? If I do the
> above step 1 and 2, and save the results into one separate file, can I
> speed up the searching speed? Can I index a file directly instead of
> TermGenerator? In a previous post,
> http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html,
> it mentioned tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed
> up the searching speed?
>
> Thank you,
> -Ying
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
More information about the Xapian-discuss
mailing list