[Xapian-discuss] bigrams search speed and index documents

Olly Betts olly at survex.com
Wed Nov 4 22:35:38 GMT 2009


On Tue, Nov 03, 2009 at 07:38:08PM -0600, Ying Liu wrote:
> I am using Xapian to index two XML files. In each file, there are about  
> 6000+ pieces of news. Each piece of news is separated by <DOC> </DOC>.  
> The way I build the index is:
>
> 1) read the XML file line by line, get one piece of news's head, date,  
> and contents which are separated by tags
> 2) remove  numbers, change to lower case,  remove stop words , and the  
> information is saved in $buf
> 3) new a Xapian::Document $doc, and use the TermGenerator to  
> set_document($doc) and index_text($buf).
> 4) add the $doc to the database $db

Please post actual code rather than trying to describe it in English.

> For the next piece of news, repeat the above 1 to 3 steps.

So you only actually add the first document to the database?

If you'd posted the actual code you were using, I wouldn't have to guess...

> The average  
> length of each news is about 200 terms. The index is very fast, about  
> one to two minutes. My question is about the searching speed. I need to  
> find the bigrams of indexed documents, i.e., find any two term's common  
> postinglist and their positionlist in the same document. I found the  
> speed is kind of low, about 1562 bigrams/hour.

I don't know how you're doing this without seeing the code.

> My question is, is it an efficient way to build the index? If I do the  
> above step 1 and 2, and save the results into one separate file, can I  
> speed up the searching speed?

I don't see how that would make any difference to search speed - the database
will contain the same terms.

> Can I index a file directly instead of  TermGenerator?

You can just call Document::add_term() and/or Document::add_posting() directly
instead of generating a string to feed to TermGenerator.  That would be an
easier and more efficient approach I think.

> In a previous post,  
> http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html,  
> it mentioned  tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed up  
> the searching speed?

XAPIAN_FLUSH_THRESHOLD only affects indexing.  It can slightly change where
posting lists chunk boundaries are, and the internal layout of blocks in the
Btree, which may indirectly affect search speed, but there's no direct effect
on searching.

Cheers,
    Olly



More information about the Xapian-discuss mailing list