[Xapian-devel] Something to think about
Richard Boulton
richard at lemurconsulting.com
Wed Oct 10 02:09:51 BST 2007
I'm planning to add multiple-database support for searches to my "Xappy"
python wrapper (more on this wrapper later, but for now, see
http://code.google.com/p/xappy for details). This is reasonably
straightforward, because Xapian supports this nicely: except that
"Xappy" generates a "fieldname->prefix" mapping automatically. The
prefix which corresponds to a particular field is therefore hidden from
the user, and crucially, it may be different in different databases.
My current plan is to add a "databaseID" term to each document, and
construct a composite query. For example, the search "author:foo"
across databases with ids "db1" and "db2", where the prefix for author
in db1 is "A" and the prefix for author in db2 is "B", would become:
(Afoo FILTER db1) OR (Bfoo FILTER db2)
This should give the right sort of results, but the statistics for the
terms will be a bit broken. (Actually, I'm not totally convinced
they'll be broken in a harmful way, because if the term is more frequent
in one collection than another, this could correspond to it being more
significant when it occurs in the collection in which it is less
frequent.) At some point it would be nice to add the ability to have a
mapping from "human-readable field name" to "prefix code" inside xapian,
so the multidatabase stuff could be aware of this issue and generate the
prefixes correctly for each database. However, that's not urgent, and
not what I'm thinking about right now.
It would also be nice to have a "virtual" posting list, which
effectively returned a list of all the document IDs in a particular
database, so I didn't have to explicitly store the "databaseID" terms.
But that's also not what I'm thinking about right now.
I was thinking about how such a query could be processed optimally. If
the sub-databases were remote databases, the search would be executed
against each database separately, and in each database all but one of
the bracketed clauses would quickly degrade to an emptypostlist (because
the filter term wouldn't be present in the database). The results of
these searches would then be combined, and this would be pretty much
optimal in terms of processing for the combined query.
However, if the sub-databases are local, the search will be performed
across all the sub-databases in parallel, and because the document IDs
for each database are interleaved, none of the bracketed clauses will be
able to degrade, or even do much "skipping" of document IDs based on the
filter term.
One way to fix this would be to add a flag (or similar mechanism)
telling a multiple database to generate composite IDs by sequentially
combining the databases; so DB1 might have IDs from 1 to 13498 and DB2
might have IDs from 13499 onwards. This would result in the bracketed
clauses skipping all document IDs which don't correspond to the
databases defined by the filter term, and should therefore be a lot more
efficient.
It could also improve the disk-access pattern when doing a simpler
multiple database search, because each database will be processed in
turn rather than skipping from one database to another.
Of course, this scheme relies on the document IDs used by each database
being relatively compact, and would result in the document IDs in a
multidatabase changing each time the highest document ID in the first
database changed, so isn't a perfect scheme by any means.
Also, there are probably ways in which using this kind of document ID
merging scheme could harm performance, but we could only really tell by
doing some tests to investigate the performance difference.
Another approach is to allow the remote-database style of multi-database
search to be used for local multi-database searches - ie, compute the
interesting part of the mset for each database separately, and then
merge them together. This can result in a lot more documents being
considered than necessary, though (particularly if the part of the mset
requested is large, or starts at a high index).
Thoughts welcomed - I'm not planning to work on this in the imminent
future, but I thought starting a discussion of it, or just mentioning it
to let the ideas ferment a while in our minds, would be worthwhile.
--
Richard
More information about the Xapian-devel
mailing list