[Xapian-discuss] 128 bit Document IDs (Please don't hurt me)

Shane Spencer shane at bogomip.com
Fri Mar 9 23:39:48 GMT 2012


I apologize for what may be a sore subject.  4 billion documents is a
heck of a lot.  64 bit vs 32 bit would be an incredibly large database
with an average document and term size.  Why 128 bit?  Simply for
address space.

Mapping a UUID (128 bit) or MongoDB ObjectID (96 bit) directly into
the Xapian document space removes the need for referencing one or the
other from one or both.  I see a common tendency to write a document
to the Xapian, return the document ID, and then write to the database
backing the document in some way.

This is nothing new.. but I really would like to remove that extra
write and optionally throw a way the Xapian response by specifying the
document ID as the UUID associated to the document.  This is starting
to become much more important as people are walking away from
auto-increment fields and aiming more toward universal identification
which, from a sparseness standpoint, is amazingly wasteful but
incredibly useful.

Thanks for your consideration.  I have no idea how complicated it
would be to make this change to Xapian, however I'd imagine migrating
the document ID into a binary like value rather than an integer value
would allow for very large document ID widths.  This probably means
adding a 16 bit length to every document ID which is pretty wasteful.

For now I'm just storing the UUID as a serialized large integer
through python-xapian and then writing the xapian document ID to my
database documents as they become indexed.

Thanks for your consideration,

Shane Spencer



More information about the Xapian-discuss mailing list