[Xapian-tickets] [Xapian] #326: Searches with chert are slow, due to slow doclen access

Thu Mar 19 00:30:18 GMT 2009

#326: Searches with chert are slow, due to slow doclen access
---------------------------+------------------------------------------------
 Reporter:  richard        |        Owner:  richard  
     Type:  defect         |       Status:  assigned 
 Priority:  normal         |    Milestone:  1.1.1    
Component:  Backend-Chert  |      Version:  SVN trunk
 Severity:  normal         |   Resolution:           
 Keywords:                 |    Blockedby:           
 Platform:  All            |     Blocking:           
---------------------------+------------------------------------------------

Comment(by olly):

 Had an idea last night...

 If the doclength encoding is a known fixed length per document, then skip
 ahead becomes O(1), and move_forward_in_chunk_to_at_least() is just a
 multiply and an add instruction.

 There's some extra space lost to encoding lengths in more bytes than they
 need, but we also save the overhead of keeping skip lists and it's much
 easier to update this structure.  Also, we only have one list of
 doclengths with one entry per document so compactness is less critical
 than it is for posting lists.

 At its simplest level, the idea would have an initial byte of the chunk to
 give the fixed width - obviously 1, 2, 3, 4 byte are useful.  I'd guess it
 might be worth considering 12 bit, etc.  And then we encode lengths like
 that until we get one which is too long.  Updating may require a slightly
 fiddly split though.

 Slightly more complex version:

 Header byte = 2 bits for width (1,2,3,4 byte), 6 bits for number of
 entries of that size - 1

 Then we have that many entries of the specified width

 Appending requires checking the encoding, and if the width is OK and the
 count is less than 32, incrementing the count and appending the new
 length.  If not, we append a new header byte and start encoding.  We can
 easily track min, max, and/or average length of the data being appended to
 judge if we should reduce the encoding width when possible too.

 Searching can step forward by up to 32 entries very cheaply.  If that
 doesn't seem enough, we could use a 2 byte header, and potentially add
 some more widths.  We could also keep a min length there and store deltas
 from it (e.g. 5 bits could store log2 of a width lower bound; 2 or 3 bits
 for the field width, and 9 or 8 bit count).

 I'm going to dump out the gmane document lengths and have a look at how
 they lie, but perhaps not until 1.1.0 is done.

-- 
Ticket URL: <http://trac.xapian.org/ticket/326#comment:12>
Xapian <http://xapian.org/>
Xapian