[Xapian-discuss] Indexing more than 15 billion documents

James Aylett james-xapian at tartarus.org
Wed Jun 24 14:44:47 BST 2009


On Wed, Jun 24, 2009 at 03:03:02AM +0100, Olly Betts wrote:

> > Sorry to follow up on an old thread, but I am wondering if there has
> > been any work done on, or interest in, increasing the maximum document
> > id beyond a 32bit limit?
> 
> Neither.
> 
> If you change types like Xapian::doccount to be 64 bit, it might just
> work - if not, it shouldn't be much work to patch up the places which
> make 32 bit assumptions.  The low level encodings are templated so will
> handle any size type.

I'm in work-avoidance mode at the moment, so I thought I'd take a pass
at this. The simplest approach seemed to be to declare
Xapian::unsigned_integer and Xapian::signed_integer, and use them for
doccount, doccount_diff, docid, termcount, termcount_diff, termpos,
termpos_diff, valueno and valueno_diff, the last two mostly because I
couldn't be bothered to figure out why there's a piece of code that
iterates over valuenos but wants them to act like docids, or vice
versa or something.

Out of the box, with two static_cast<>s (to turn 0u and 1u into our
unsigned_integer type instead of just unsigned int), this passes all
tests except stub on all backends except remote. I believe that stub
actually uses remote anyway, so this just means that the network
protocol can't handle 64 bit, which was expected.

This is on Mac OS X, which /probably/ means that most Unixoids will
behave similarly. This is actually very good news; last time I tried
this (several years ago) almost everything blew up when I tried to
switch over to 64 bit types.

The error is:

  NetworkError: Received EOF (context: remote:tcp(127.0.0.1:1239))

(or remote:prog(...) for the stub failure, etc.)

Which I assume is the protocol blowing up.

I've created ticket:385 <http://trac.xapian.org/ticket/385> to track
this, and attached my patch to it. (I'm certain there are problems
with the patch as it stands ;-)

(In a future where we support this, it's unclear as yet to me what the
right approach is. 64 bits will slow down access in some cases, but on
some CPUs other aspects of 64 bit access will be faster. I guess we
just have to profile a lot and perhaps have it as a configure
option until/unless it's the clear winner. Shudder.)

J

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org



More information about the Xapian-discuss mailing list