[Xapian-tickets] [Xapian] #385: Expanding docids (etc) beyond 32 bit types
Xapian
nobody at xapian.org
Thu Jun 25 12:17:48 BST 2009
#385: Expanding docids (etc) beyond 32 bit types
-------------------------+--------------------------------------------------
Reporter: james | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Other | Version: SVN trunk
Severity: minor | Keywords:
Blockedby: | Platform: All
Blocking: |
-------------------------+--------------------------------------------------
Changes (by olly):
* version: => SVN trunk
Comment:
Replying to the mailing list message here to keep all the discussion
together.
> I'm in work-avoidance mode at the moment, so I thought I'd take a pass
> at this. The simplest approach seemed to be to declare
> Xapian::unsigned_integer and Xapian::signed_integer, and use them for
> doccount, doccount_diff, docid, termcount, termcount_diff, termpos,
> termpos_diff, valueno and valueno_diff, the last two mostly because I
> couldn't be bothered to figure out why there's a piece of code that
> iterates over valuenos but wants them to act like docids, or vice
> versa or something.
I'm not keen on adding new public types for this. They aren't useful in
themselves, and these types don't all need to be the same size - they're
just all "int" currently as that's "big enough" (or at least were for
most people) on all modern plaforms - so a common type for them doesn't
really make logical sense either.
If valueno needs changing, that's a bug. Ideally termcount shouldn't need
changing (more than 4 billion terms per document doesn't seem like a sane
scenario, and isn't going to work sanely with the current termlist storage
anyway), but we would need a new type for collection frequency. We
probably
should have one anyway since the collection frequency of a term which
occurs
many times in many documents will for many users probably overflow 32 bits
before you add 4 billion documents.
> Out of the box, with two static_cast<>s (to turn 0u and 1u into our
> unsigned_integer type instead of just unsigned int)
Both arguments of {{{std::min()}}} and {{{std::max()}}} should be the same
type, so the cast should be to that type not whatever it happens to be
typedef-ed currently, so 1u is wrong even as things stand. I've fixed
that
in trunk.
> this passes all
> tests except stub on all backends except remote. I believe that stub
> actually uses remote anyway,
A stub database can refer to any database backend(s), and there are
several
stub tests which are run over various actual backends, but if the remote
tests fail, then stub tests run over the remote backend are likely to as
well!
> so this just means that the network
> protocol can't handle 64 bit, which was expected.
The remote protocol uses variable length integer encodings produced by
templated functions, so I'd actually expect it would just work. Hard to
guess what might be wrong. Running xapian-tcpsrv by hand in one terminal
and performing a search on it from another (e.g. via a stub db file and
examples/quest) might show what's going on.
I should note that passing the testsuite wouldn't actually be saying much
about this patch - it would really need testing with more than 4 billion
documents. The testsuite can't sanely do that due to the time and space
required, but it can test with really large docids, and I guess people
wanting this support can report any issues they hit.
> (In a future where we support this, it's unclear as yet to me what the
> right approach is. 64 bits will slow down access in some cases, but on
> some CPUs other aspects of 64 bit access will be faster. I guess we
> just have to profile a lot and perhaps have it as a configure
> option until/unless it's the clear winner. Shudder.)
It's always going to be slower on a CPU without 64 bit registers, and
sadly
it's not going to be ABI compatible in general.
Right now, the sanest approach is probably just for people
who actually need it to enable it - if you're handling more than 4 billion
documents, having to work with a specially built package isn't likely to
be a huge deal.
--
Ticket URL: <http://trac.xapian.org/ticket/385#comment:2>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list