[Xapian-tickets] [Xapian] #326: Searches with chert are slow, due to slow doclen access
Xapian
nobody at xapian.org
Thu Mar 19 00:30:18 GMT 2009
#326: Searches with chert are slow, due to slow doclen access
---------------------------+------------------------------------------------
Reporter: richard | Owner: richard
Type: defect | Status: assigned
Priority: normal | Milestone: 1.1.1
Component: Backend-Chert | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
---------------------------+------------------------------------------------
Comment(by olly):
Had an idea last night...
If the doclength encoding is a known fixed length per document, then skip
ahead becomes O(1), and move_forward_in_chunk_to_at_least() is just a
multiply and an add instruction.
There's some extra space lost to encoding lengths in more bytes than they
need, but we also save the overhead of keeping skip lists and it's much
easier to update this structure. Also, we only have one list of
doclengths with one entry per document so compactness is less critical
than it is for posting lists.
At its simplest level, the idea would have an initial byte of the chunk to
give the fixed width - obviously 1, 2, 3, 4 byte are useful. I'd guess it
might be worth considering 12 bit, etc. And then we encode lengths like
that until we get one which is too long. Updating may require a slightly
fiddly split though.
Slightly more complex version:
Header byte = 2 bits for width (1,2,3,4 byte), 6 bits for number of
entries of that size - 1
Then we have that many entries of the specified width
Appending requires checking the encoding, and if the width is OK and the
count is less than 32, incrementing the count and appending the new
length. If not, we append a new header byte and start encoding. We can
easily track min, max, and/or average length of the data being appended to
judge if we should reduce the encoding width when possible too.
Searching can step forward by up to 32 entries very cheaply. If that
doesn't seem enough, we could use a 2 byte header, and potentially add
some more widths. We could also keep a min length there and store deltas
from it (e.g. 5 bits could store log2 of a width lower bound; 2 or 3 bits
for the field width, and 9 or 8 bit count).
I'm going to dump out the gmane document lengths and have a look at how
they lie, but perhaps not until 1.1.0 is done.
--
Ticket URL: <http://trac.xapian.org/ticket/326#comment:12>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list