[Xapian-discuss] bigrams search speed and index documents
Ying Liu
liux0395 at umn.edu
Wed Nov 4 01:38:08 GMT 2009
Hello all,
I am using Xapian to index two XML files. In each file, there are about
6000+ pieces of news. Each piece of news is separated by <DOC> </DOC>.
The way I build the index is:
1) read the XML file line by line, get one piece of news's head, date,
and contents which are separated by tags
2) remove numbers, change to lower case, remove stop words , and the
information is saved in $buf
3) new a Xapian::Document $doc, and use the TermGenerator to
set_document($doc) and index_text($buf).
4) add the $doc to the database $db
For the next piece of news, repeat the above 1 to 3 steps. The average
length of each news is about 200 terms. The index is very fast, about
one to two minutes. My question is about the searching speed. I need to
find the bigrams of indexed documents, i.e., find any two term's common
postinglist and their positionlist in the same document. I found the
speed is kind of low, about 1562 bigrams/hour.
My question is, is it an efficient way to build the index? If I do the
above step 1 and 2, and save the results into one separate file, can I
speed up the searching speed? Can I index a file directly instead of
TermGenerator? In a previous post,
http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html,
it mentioned tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed up
the searching speed?
Thank you,
-Ying
More information about the Xapian-discuss
mailing list