[Xapian-discuss] bigrams search speed and index documents

Wed Nov 4 01:38:08 GMT 2009

Hello all,

I am using Xapian to index two XML files. In each file, there are about 
6000+ pieces of news. Each piece of news is separated by <DOC> </DOC>. 
The way I build the index is:

1) read the XML file line by line, get one piece of news's head, date, 
and contents which are separated by tags
2) remove  numbers, change to lower case,  remove stop words , and the 
information is saved in $buf
3) new a Xapian::Document $doc, and use the TermGenerator to 
set_document($doc) and index_text($buf).
4) add the $doc to the database $db

For the next piece of news, repeat the above 1 to 3 steps. The average 
length of each news is about 200 terms. The index is very fast, about 
one to two minutes. My question is about the searching speed. I need to 
find the bigrams of indexed documents, i.e., find any two term's common 
postinglist and their positionlist in the same document. I found the 
speed is kind of low, about 1562 bigrams/hour.

My question is, is it an efficient way to build the index? If I do the 
above step 1 and 2, and save the results into one separate file, can I 
speed up the searching speed? Can I index a file directly instead of 
TermGenerator? In a previous post, 
http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html, 
it mentioned  tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed up 
the searching speed?

Thank you,
-Ying