[Xapian-discuss] 128 bit Document IDs (Please don't hurt me)

Olly Betts olly at survex.com
Mon Mar 12 13:12:01 GMT 2012


On Fri, Mar 09, 2012 at 02:39:48PM -0900, Shane Spencer wrote:
> I apologize for what may be a sore subject.  4 billion documents is a
> heck of a lot.  64 bit vs 32 bit would be an incredibly large database
> with an average document and term size.  Why 128 bit?  Simply for
> address space.
> 
> Mapping a UUID (128 bit) or MongoDB ObjectID (96 bit) directly into
> the Xapian document space removes the need for referencing one or the
> other from one or both.  I see a common tendency to write a document
> to the Xapian, return the document ID, and then write to the database
> backing the document in some way.

As James notes, you can store the ID as a term in Xapian.

> This is nothing new.. but I really would like to remove that extra
> write and optionally throw a way the Xapian response by specifying the
> document ID as the UUID associated to the document.  This is starting
> to become much more important as people are walking away from
> auto-increment fields and aiming more toward universal identification
> which, from a sparseness standpoint, is amazingly wasteful but
> incredibly useful.
> 
> Thanks for your consideration.  I have no idea how complicated it
> would be to make this change to Xapian, however I'd imagine migrating
> the document ID into a binary like value rather than an integer value
> would allow for very large document ID widths.  This probably means
> adding a 16 bit length to every document ID which is pretty wasteful.

You're making incorrect assumptions about how Xapian stores document
IDs.  They're stored as variable length integers, and the encoding
naturally extends to any size of integer.

At least conceptually, it's fairly easy to make the change you are
suggesting.  People have looked at making the change to 64 bit docids:

http://trac.xapian.org/ticket/385

It's mostly just a matter of changing the type used to "long long",
but assumptions creep in so there are probably a few other fixes needed.

Changing to 128-bit docids isn't much harder.  Most platforms don't
have a 128-bit integer type, but you can make one with a C++ class
and operator overloading.  Then just plug that in instead of
"long long" (and probably fix a few assumptions).  The only limitation
I can see is that this reduces the maximum term length a bit (since we
need to build Btree keys from a term and a docid, so if the docid can
be wider, the term can't be quite as long.

However, Xapian stores deltas between document ids a lot, and if you
create this ultra-sparse space of document ids, these deltas will
tend to be billions rather than being small integers.  That means
everything takes more space to store - probably much more than it
would take to just store each document's UUID as a term.

Cheers,
    Olly



More information about the Xapian-discuss mailing list