[Xapian-devel] Something to think about

Wed Oct 10 02:09:51 BST 2007

I'm planning to add multiple-database support for searches to my "Xappy" 
python wrapper (more on this wrapper later, but for now, see 
http://code.google.com/p/xappy for details).  This is reasonably 
straightforward, because Xapian supports this nicely: except that 
"Xappy" generates a "fieldname->prefix" mapping automatically.  The 
prefix which corresponds to a particular field is therefore hidden from 
the user, and crucially, it may be different in different databases.

My current plan is to add a "databaseID" term to each document, and 
construct a composite query.  For example, the search "author:foo" 
across databases with ids "db1" and "db2", where the prefix for author 
in db1 is "A" and the prefix for author in db2 is "B", would become:

  (Afoo FILTER db1) OR (Bfoo FILTER db2)

This should give the right sort of results, but the statistics for the 
terms will be a bit broken.  (Actually, I'm not totally convinced 
they'll be broken in a harmful way, because if the term is more frequent 
in one collection than another, this could correspond to it being more 
significant when it occurs in the collection in which it is less 
frequent.)  At some point it would be nice to add the ability to have a 
mapping from "human-readable field name" to "prefix code" inside xapian, 
so the multidatabase stuff could be aware of this issue and generate the 
prefixes correctly for each database.  However, that's not urgent, and 
not what I'm thinking about right now.

It would also be nice to have a "virtual" posting list, which 
effectively returned a list of all the document IDs in a particular 
database, so I didn't have to explicitly store the "databaseID" terms. 
But that's also not what I'm thinking about right now.

I was thinking about how such a query could be processed optimally.  If 
the sub-databases were remote databases, the search would be executed 
against each database separately, and in each database all but one of 
the bracketed clauses would quickly degrade to an emptypostlist (because 
the filter term wouldn't be present in the database).  The results of 
these searches would then be combined, and this would be pretty much 
optimal in terms of processing for the combined query.

However, if the sub-databases are local, the search will be performed 
across all the sub-databases in parallel, and because the document IDs 
for each database are interleaved, none of the bracketed clauses will be 
able to degrade, or even do much "skipping" of document IDs based on the 
filter term.

One way to fix this would be to add a flag (or similar mechanism) 
telling a multiple database to generate composite IDs by sequentially 
combining the databases; so DB1 might have IDs from 1 to 13498 and DB2 
might have IDs from 13499 onwards.  This would result in the bracketed 
clauses skipping all document IDs which don't correspond to the 
databases defined by the filter term, and should therefore be a lot more 
efficient.

It could also improve the disk-access pattern when doing a simpler 
multiple database search, because each database will be processed in 
turn rather than skipping from one database to another.

Of course, this scheme relies on the document IDs used by each database 
being relatively compact, and would result in the document IDs in a 
multidatabase changing each time the highest document ID in the first 
database changed, so isn't a perfect scheme by any means.

Also, there are probably ways in which using this kind of document ID 
merging scheme could harm performance, but we could only really tell by 
doing some tests to investigate the performance difference.

Another approach is to allow the remote-database style of multi-database 
search to be used for local multi-database searches - ie, compute the 
interesting part of the mset for each database separately, and then 
merge them together.  This can result in a lot more documents being 
considered than necessary, though (particularly if the part of the mset 
requested is large, or starts at a high index).

Thoughts welcomed - I'm not planning to work on this in the imminent 
future, but I thought starting a discussion of it, or just mentioning it 
to let the ideas ferment a while in our minds, would be worthwhile.

-- 
Richard