[Xapian-devel] Something to think about

Fri Oct 12 14:22:32 BST 2007

On Wed, Oct 10, 2007 at 02:09:51AM +0100, Richard Boulton wrote:
> I'm planning to add multiple-database support for searches to my "Xappy" 
> python wrapper (more on this wrapper later, but for now, see 
> http://code.google.com/p/xappy for details).  This is reasonably 
> straightforward, because Xapian supports this nicely: except that 
> "Xappy" generates a "fieldname->prefix" mapping automatically.  The 
> prefix which corresponds to a particular field is therefore hidden from 
> the user, and crucially, it may be different in different databases.

I think the simplest solution here would be to just use the user's
fieldname as the prefix.  So the "shoe_size" field could be mapped to
"XSHOE_SIZE".  You could add special handling for standard prefixes
if you wish.

If you want case sensitivity of field names, you could either just
eschew the usual Xapian scheme, or provide some sort of encoding
for the case.

> One way to fix this would be to add a flag (or similar mechanism) 
> telling a multiple database to generate composite IDs by sequentially 
> combining the databases; so DB1 might have IDs from 1 to 13498 and DB2 
> might have IDs from 13499 onwards.  [...]
> Of course, this scheme relies on the document IDs used by each database 
> being relatively compact, and would result in the document IDs in a 
> multidatabase changing each time the highest document ID in the first 
> database changed, so isn't a perfect scheme by any means.

I think it would be useful to support this in some way anyway.
Interleaving isn't a perfect solution either.  Really its main benefit
is simply that it does provide stable merged document ids even if the
constituent databases are updated.

> Another approach is to allow the remote-database style of multi-database 
> search to be used for local multi-database searches - ie, compute the 
> interesting part of the mset for each database separately, and then 
> merge them together.  This can result in a lot more documents being 
> considered than necessary, though (particularly if the part of the mset 
> requested is large, or starts at a high index).

If the local results are computed sequentially, you could use the
minimum MSet weight from the first match as the initial min-weight for
the second match.  If you merge each new MSet in as you go, this would
allow each match to do progressively less work.

Cheers,
    Olly