[Xapian-discuss] Tuning Phrase Searching

Arjen van der Meijden acmmailing at tweakers.net
Sat Nov 5 21:39:43 GMT 2005


On 3-11-2005 13:32, tech at dbx.co.uk wrote:
> Is there anything that can be done to speed up phrase searching? It is
> currently a show stopper for our CV search system with queries for common
> terms taking several minutes to execute. Simply ANDing the terms together
> will return in 1-3 seconds.

If you know beforehand what your phrase will be like and how you'll 
search them you may be able to. I.e. if you have system paths and look 
through them in "tree-order", you can just build up the subpaths and 
index them as as normal terms (/usr/local/bin/omega can be /usr, 
/usr/local, /usr/local/bin).
But if its just plain text and you want normal sentences to be 
retrievable... you're probaby just stuck to finding each document 
containing the terms and checking whether those terms are in the correct 
order. There are searchengines which only use word-pairs and can 
therefore not correctly identify hits (they also see "foo bar", "bar 
test" as a match for "foo bar test").
It may be faster to combine such word-pairs with normal phrase 
searching, build a query that checks for the correct word-pairs and the 
phrase.
The drawback is of course that you'll increase the size of your postlist 
quite a bit (you don't need it in the position table however). But the 
advantage should be that you can decrease the list of documents a lot 
better than with the normal "and search" which is the basis for the 
phrase search.

> I keep thinking that I must be missing something in either the way I index
> or the way I (or rather the QueryParser) constructs the queries.

In the general case, I don't think there really is a better way. But if 
space is no problem and the speed of the position table is the most 
important part, you may be able to increase the size of the indexes to 
decrease the number of documents to look through.
Olly already mentioned using Flint, using xapian-compact to further 
decrease the size of the database may help a lot for searches. You may 
want to keep two versions of your database, the non-compacted for 
updating and the fully compacted for searches.
For Flint the compaction is a bit less dramatic than for Quartz, with 
Flint our 14G non-compacted database decreases to 12G compacted (which 
uses zlib-compression as well). The drawback of compaction is of course 
the time it takes, it takes one hour to compact on our machine.

Best regards,

Arjen



More information about the Xapian-discuss mailing list