[Xapian-discuss] Filtering queries with many boolean terms

Tue Oct 6 11:00:15 BST 2009

On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:
> Unfortunately, xapian is quite slow at updating many documents.
> Updating 10k documents by adding an XTAGfoo term takes 40 seconds
> (including db flush) on my system.  One use-case for updating this many
> documents is, for example, applying a tag to existing emails in a
> mailing list.

Currently we rewrite all the terms if any are changed, which is why it
is a bit slow.  This could be improved:

http://trac.xapian.org/ticket/250

So one approach would just be to live with the 40 seconds for now and
you'll get a nice speed up when that optimisation is implemented.

> So, I am looking into an alternative approach.  I'm already maintaining
> a table of all threads (gmail would call them conversations) in an sql
> database.  All emails in the xapian database are given an XTID<n> term
> corresponding to the thread id in the sql database.  If I want to search
> for all documents with the word "foo" and with tag "bar", I figured I'd
> first look up all thread ids with tag "bar" in SQL, and then use those
> ids as filters with the Xapian search.  So the Xapian query might look
> like:
> 
>         (foo) (tid:0 tid:1 tid:2 tid:3)

Note that you really shouldn't build up query strings mechanically to
pass to the QueryParser.  It doesn't have a rigidly defined syntax, but
is really a heuristic parser aiming to handle potentially poorly formed
human written input (and those heuristics can change between releases)
so it may not parse your built query as you intended, or may stop
doing so.

Instead parse "foo" and then append the filters using the API:

    // Assuming the tids are listed in vector<string> tids:
    Xapian::Query filter(Xapian::Query::OP_OR, tids.begin(), tids.end());
    Xapian::Query q = qp->parse_query("foo");
    q = Xapian::Query(Xapian::Query::OP_FILTER, q, filter);

> Where tid: is prefix for boolean term XTID.  This translates to:
> 
>         Xapian::Query((Zfoo:(pos=1) AND 0 * (XTID0 OR XTID1 OR XTID2 OR
>         XTID3)))
> 
> This works fine until I start to add many XTID terms, and then I start
> to notice that query time increases (about quadratically it seems) with
> the number of terms.  This surprised me, because a search for "foo" by
> itself finishes in 0.0005 seconds with all matches fetched.  I expected
> the XTID terms to apply as a filter, and therefore be no worse than the
> time it takes for the query to complete (with all matches fetched)
> without the XTID terms.

Well the filter part will take time to evaluate, but will potentially
reduce the work required for the probabilistic part.  So it's not
totally clear whether it will be fast with or without the filter,
but I/O is usually the limiting factor (having to read a single block
from disk takes many many CPU cycles) so I'd probably guess the filtered
version would be slower if there are a lot of "XTID" terms in it.

> Indeed, if the result set for query "foo" is sufficiently low (maybe a
> few thousand documents), it's actually faster for me to iterate over
> them all and manually check the XTID values myself.

But this looks odd to me.  When you say "query time increases", what
are you actually timing here?  If it includes the query parsing time
then I suspect the quadratic behaviour is probably there.

Pair-wise construction of a long query is still quadratic:

http://trac.xapian.org/ticket/273

We've worked around this in places in the QueryParser by putting
subqueries in a container and building a query from that, but it's
quite possible this can still happen in some cases.  Generally
humans don't enter queries with more than a dozen terms, so quadratic
behaviour usually doesn't kick in and so can easy escape notice.

I'll try to produce a simple testcase later today.

> I've also observed that if I put "OR" between the tid: terms, it
> actually generates a different query that executes in less than half the
> time than without "OR":
> 
>         (foo) (tid:0 OR tid:1 OR tid:2 OR tid:3)
>         
> becomes:
>         
>         Xapian::Query((Zfoo:(pos=1) AND (0 * XTID0 OR 0 * XTID1 OR 0 *
>         XTID2 OR 0 * XTID3)))

This is just a different representation with the same meaning as the
query above and should result in exactly the same internal
representation in the matcher and so have exactly the same performance.
If it's much faster, that suggests that this is a parsing issue.

> Lastly, if I remove the parens, a different query still is generated:
> 
>         (foo) tid:0 tid:1 tid:2 tid:3
>         
> becomes:
>         
>         Xapian::Query((Zfoo:(pos=1) FILTER (XTID0 OR XTID1 OR XTID2 OR
>         TID3)))

This one should also result in the same internal representation in the
matcher.

Cheers,
    Olly