[Xapian-tickets] [Xapian] #385: Expanding docids (etc) beyond 32 bit types

Xapian nobody at xapian.org
Thu Jun 25 12:17:48 BST 2009


#385: Expanding docids (etc) beyond 32 bit types
-------------------------+--------------------------------------------------
 Reporter:  james        |       Owner:  olly     
     Type:  enhancement  |      Status:  new      
 Priority:  normal       |   Milestone:           
Component:  Other        |     Version:  SVN trunk
 Severity:  minor        |    Keywords:           
Blockedby:               |    Platform:  All      
 Blocking:               |  
-------------------------+--------------------------------------------------
Changes (by olly):

  * version:  => SVN trunk


Comment:

 Replying to the mailing list message here to keep all the discussion
 together.

 > I'm in work-avoidance mode at the moment, so I thought I'd take a pass
 > at this. The simplest approach seemed to be to declare
 > Xapian::unsigned_integer and Xapian::signed_integer, and use them for
 > doccount, doccount_diff, docid, termcount, termcount_diff, termpos,
 > termpos_diff, valueno and valueno_diff, the last two mostly because I
 > couldn't be bothered to figure out why there's a piece of code that
 > iterates over valuenos but wants them to act like docids, or vice
 > versa or something.

 I'm not keen on adding new public types for this.  They aren't useful in
 themselves, and these types don't all need to be the same size - they're
 just all "int" currently as that's "big enough" (or at least were for
 most people) on all modern plaforms - so a common type for them doesn't
 really make logical sense either.

 If valueno needs changing, that's a bug.  Ideally termcount shouldn't need
 changing (more than 4 billion terms per document doesn't seem like a sane
 scenario, and isn't going to work sanely with the current termlist storage
 anyway), but we would need a new type for collection frequency.  We
 probably
 should have one anyway since the collection frequency of a term which
 occurs
 many times in many documents will for many users probably overflow 32 bits
 before you add 4 billion documents.

 > Out of the box, with two static_cast<>s (to turn 0u and 1u into our
 > unsigned_integer type instead of just unsigned int)

 Both arguments of {{{std::min()}}} and {{{std::max()}}} should be the same
 type, so the cast should be to that type not whatever it happens to be
 typedef-ed currently, so 1u is wrong even as things stand.  I've fixed
 that
 in trunk.

 > this passes all
 > tests except stub on all backends except remote. I believe that stub
 > actually uses remote anyway,

 A stub database can refer to any database backend(s), and there are
 several
 stub tests which are run over various actual backends, but if the remote
 tests fail, then stub tests run over the remote backend are likely to as
 well!

 > so this just means that the network
 > protocol can't handle 64 bit, which was expected.

 The remote protocol uses variable length integer encodings produced by
 templated functions, so I'd actually expect it would just work.  Hard to
 guess what might be wrong.  Running xapian-tcpsrv by hand in one terminal
 and performing a search on it from another (e.g. via a stub db file and
 examples/quest) might show what's going on.

 I should note that passing the testsuite wouldn't actually be saying much
 about this patch - it would really need testing with more than 4 billion
 documents.  The testsuite can't sanely do that due to the time and space
 required, but it can test with really large docids, and I guess people
 wanting this support can report any issues they hit.

 > (In a future where we support this, it's unclear as yet to me what the
 > right approach is. 64 bits will slow down access in some cases, but on
 > some CPUs other aspects of 64 bit access will be faster. I guess we
 > just have to profile a lot and perhaps have it as a configure
 > option until/unless it's the clear winner. Shudder.)

 It's always going to be slower on a CPU without 64 bit registers, and
 sadly
 it's not going to be ABI compatible in general.

 Right now, the sanest approach is probably just for people
 who actually need it to enable it - if you're handling more than 4 billion
 documents, having to work with a specially built package isn't likely to
 be a huge deal.

-- 
Ticket URL: <http://trac.xapian.org/ticket/385#comment:2>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list