[Xapian-discuss] merge database and maintain order

Olly Betts olly at survex.com
Sun Mar 25 00:02:50 GMT 2007


On Sat, Mar 24, 2007 at 10:05:34PM +0000, Mark Clarkson wrote:
> I've found through experimentation that merging databases maintains the
> strict date order. In my case I have a 25 GB database, db1, and a daily
> tiny database, db2. If I merge with 'xapian-compact db1 db2 dbnew' then
> the date order is preserved in dbnew.

Yes.  Both quartzcompact and xapian-compact copy all the documents from
the first database, followed by all those in the second, and so on.
This is actually required for efficiency - it means we don't need to
reencode most of the posting lists, since they encode the difference
between one document id and the next.

> However I don't know how to maintain this date order when searching db1
> and db2 together before merging, i.e. by adding both db1 and db2 to a
> Database object and passing this to enquire. Is this possible?

Not currently.

When you search two databases, we currently interleave the document ids.
The original reason we chose this approach was that it means that
existing document ids for a set of databases being searched together
remain stable even when new documents are added to some of the
databases.  It also keeps the document ids in the combined database
small.

As a historical note, the system Xapian was originally written to
replace used an arbitrary offset for the first docid in each database --
e.g. multiples of 1000000 -- which gives large document ids even if the
databases are all small, and fails badly if any database has more
documents than the offset.  But if the offset is set too large, it limits
how many databases can be combined.

But there's a common situation where you a series of databases and only
update the last.  Similarly, it's common to have a "main" database, plus
updates in a second database.  Periodically the two are merged to give a
new "main", and a new second database started from scratch.  In both
cases it would be nice to be able to have the merged document ids
generated in a similar way to those you'd get from xapian-compact.

It wouldn't be too hard to implement I think.  The trickiest part might
be gracefully handling the case when a database other than the last has
documents added.  Or perhaps calling "reopen()" on the combined database
would just change the offsets, so we'd just renumber the document ids in
that case?

Hmm, actually I see a neat hack.  If you add the first document to db2
with a document id at least one more than the last document id of db1
then the merged document ids will preserve the order within each db
but put all the documents in db1 before those in db2.  Currently
xapian-compact preserves spans of unused document ids at the start and
end of the database, but that would be easy to fix.

Cheers,
    Olly



More information about the Xapian-discuss mailing list