[Xapian-discuss] Filtering queries with many boolean terms

Jason Tackaberry tack at urandom.ca
Sun Oct 4 02:15:48 BST 2009


Hi all,

I'm thinking of using xapian to index my growing mail collection, which
I suppose is a reasonably common use-case.  (The fact that gmane has
used xapian to index some tens of millions of emails is what piqued my
interest.)

First I want to start off with how very impressed I've been with Xapian.
The API is nice, feature-complete bindings are available for my
preferred language, it's very fast, and it generally Just Works exactly
how I expect.  It's very good quality software.

One of the features I'd like to have is gmail-like tagging of threads.
This means I need to be able to add and remove tags to arbitrarily many
documents after they have been indexed.

Unfortunately, xapian is quite slow at updating many documents.
Updating 10k documents by adding an XTAGfoo term takes 40 seconds
(including db flush) on my system.  One use-case for updating this many
documents is, for example, applying a tag to existing emails in a
mailing list.

So, I am looking into an alternative approach.  I'm already maintaining
a table of all threads (gmail would call them conversations) in an sql
database.  All emails in the xapian database are given an XTID<n> term
corresponding to the thread id in the sql database.  If I want to search
for all documents with the word "foo" and with tag "bar", I figured I'd
first look up all thread ids with tag "bar" in SQL, and then use those
ids as filters with the Xapian search.  So the Xapian query might look
like:

        (foo) (tid:0 tid:1 tid:2 tid:3)

Where tid: is prefix for boolean term XTID.  This translates to:

        Xapian::Query((Zfoo:(pos=1) AND 0 * (XTID0 OR XTID1 OR XTID2 OR
        XTID3)))


This works fine until I start to add many XTID terms, and then I start
to notice that query time increases (about quadratically it seems) with
the number of terms.  This surprised me, because a search for "foo" by
itself finishes in 0.0005 seconds with all matches fetched.  I expected
the XTID terms to apply as a filter, and therefore be no worse than the
time it takes for the query to complete (with all matches fetched)
without the XTID terms.

Indeed, if the result set for query "foo" is sufficiently low (maybe a
few thousand documents), it's actually faster for me to iterate over
them all and manually check the XTID values myself.

Is there any way to improve Xapian with this type of search?

I've also observed that if I put "OR" between the tid: terms, it
actually generates a different query that executes in less than half the
time than without "OR":

        (foo) (tid:0 OR tid:1 OR tid:2 OR tid:3)
        
becomes:
        
        Xapian::Query((Zfoo:(pos=1) AND (0 * XTID0 OR 0 * XTID1 OR 0 *
        XTID2 OR 0 * XTID3)))

This was another surprising result, because I understood that "OR" was
implicit when multiple boolean terms were specified, so I didn't expect
a different query to be produced.

Lastly, if I remove the parens, a different query still is generated:

        (foo) tid:0 tid:1 tid:2 tid:3
        
becomes:
        
        Xapian::Query((Zfoo:(pos=1) FILTER (XTID0 OR XTID1 OR XTID2 OR
        TID3)))

Unfortunately this has about the same performance as the first form.

Even with the fastest form (tid:0 OR tid:1 OR tid:2 [...]) it's still a
bit disappointing.  Hopefully someone can shed some insight as to what's
happening, and how I might improve the performance.

Thanks!
Jason.




More information about the Xapian-discuss mailing list