[Xapian-devel] Something to think about

Wed Oct 17 19:37:27 BST 2007

On 10/9/07, Richard Boulton <richard at lemurconsulting.com> wrote:
> I'm planning to add multiple-database support for searches to my "Xappy"
> python wrapper (more on this wrapper later, but for now, see
> http://code.google.com/p/xappy for details).  This is reasonably
> straightforward, because Xapian supports this nicely: except that
> "Xappy" generates a "fieldname->prefix" mapping automatically.  The
> prefix which corresponds to a particular field is therefore hidden from
> the user, and crucially, it may be different in different databases.
>
> My current plan is to add a "databaseID" term to each document, and
> construct a composite query.  For example, the search "author:foo"
> across databases with ids "db1" and "db2", where the prefix for author
> in db1 is "A" and the prefix for author in db2 is "B", would become:
>
>  (Afoo FILTER db1) OR (Bfoo FILTER db2)
>
> This should give the right sort of results, but the statistics for the
> terms will be a bit broken.  (Actually, I'm not totally convinced
> they'll be broken in a harmful way, because if the term is more frequent
> in one collection than another, this could correspond to it being more
> significant when it occurs in the collection in which it is less
> frequent.)  At some point it would be nice to add the ability to have a
> mapping from "human-readable field name" to "prefix code" inside xapian,
> so the multidatabase stuff could be aware of this issue and generate the
> prefixes correctly for each database.  However, that's not urgent, and
> not what I'm thinking about right now.
>
> It would also be nice to have a "virtual" posting list, which
> effectively returned a list of all the document IDs in a particular
> database, so I didn't have to explicitly store the "databaseID" terms.
> But that's also not what I'm thinking about right now.
>
> I was thinking about how such a query could be processed optimally.  If
> the sub-databases were remote databases, the search would be executed
> against each database separately, and in each database all but one of
> the bracketed clauses would quickly degrade to an emptypostlist (because
> the filter term wouldn't be present in the database).  The results of
> these searches would then be combined, and this would be pretty much
> optimal in terms of processing for the combined query.
>
> However, if the sub-databases are local, the search will be performed
> across all the sub-databases in parallel, and because the document IDs
> for each database are interleaved, none of the bracketed clauses will be
> able to degrade, or even do much "skipping" of document IDs based on the
> filter term.
>
> One way to fix this would be to add a flag (or similar mechanism)
> telling a multiple database to generate composite IDs by sequentially
> combining the databases; so DB1 might have IDs from 1 to 13498 and DB2
> might have IDs from 13499 onwards.  This would result in the bracketed
> clauses skipping all document IDs which don't correspond to the
> databases defined by the filter term, and should therefore be a lot more
> efficient.
>
> It could also improve the disk-access pattern when doing a simpler
> multiple database search, because each database will be processed in
> turn rather than skipping from one database to another.
>
> Of course, this scheme relies on the document IDs used by each database
> being relatively compact, and would result in the document IDs in a
> multidatabase changing each time the highest document ID in the first
> database changed, so isn't a perfect scheme by any means.
>
> Also, there are probably ways in which using this kind of document ID
> merging scheme could harm performance, but we could only really tell by
> doing some tests to investigate the performance difference.

I will definitely measure the performance differences on my 52 million
document engine and will give the true performance numbers because
indexing and searching performance is my top priority.

Kevin Duraj

>
> Another approach is to allow the remote-database style of multi-database
> search to be used for local multi-database searches - ie, compute the
> interesting part of the mset for each database separately, and then
> merge them together.  This can result in a lot more documents being
> considered than necessary, though (particularly if the part of the mset
> requested is large, or starts at a high index).
>
> Thoughts welcomed - I'm not planning to work on this in the imminent
> future, but I thought starting a discussion of it, or just mentioning it
> to let the ideas ferment a while in our minds, would be worthwhile.
>
> --
> Richard
>
> _______________________________________________
> Xapian-devel mailing list
> Xapian-devel at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-devel
>