[Xapian-tickets] [Xapian] #326: Change doc length chunk encoding so skipping through a chunk is better than O(n)

Sat Mar 24 13:05:52 GMT 2012

#326: Change doc length chunk encoding so skipping through a chunk is better than
O(n)
---------------------------+------------------------------------------------
 Reporter:  richard        |       Owner:  olly     
     Type:  defect         |      Status:  assigned 
 Priority:  normal         |   Milestone:  1.2.x    
Component:  Backend-Brass  |     Version:  SVN trunk
 Severity:  normal         |    Keywords:           
Blockedby:                 |    Platform:  All      
 Blocking:                 |  
---------------------------+------------------------------------------------
Description changed by olly:

Old description:

> I've built a benchmark database of slightly over 100,000 documents from
> wikipedia, and indexed these with flint and chert.  When searching the
> resulting databases, with 10,000 single term searches, with the database
> fully cached, the flint database completes all searches in 1.78 seconds,
> whereas the chert database completes all searches in 12.58 seconds - ie,
> chert is about 7 times slower than flint.
>
> Note that the chert database is considerably smaller than the flint
> database, which hopefully means that in the uncached case chert might
> perform better.  However, with databases under 1m documents, we're likely
> to be IO bound, and performance will be much worse with chert.
>
> Profiling the code with callgrind revealed that around 85% of the CPU
> time is being spent in get_doclength() calls, and around 90% of that time
> is spent in ChertPostList::move_forward_in_chunk_to_at_least().  This
> method calls next_in_chunk() repeatedly to find the appropriate doclen:
> on average, it calls next_in_chunk() around 30 times per call.
>
> I don't think this degree of slowdown is acceptable, so we need to either
> find a way to make the code faster with the existing datastructure, or
> find a way to allow faster seeking in the doclen list.
>
> I've done some experiments about this, which are detailed in the
> comments.  So far, a combination of avoid_string_operations.patch and
> optimise_unpack_uint.patch give the best result, but still result in
> searches 4.5 times slower than with flint.  I think a redesign of the way
> doclen information is stored is needed to get acceptable speeds.

New description:

 I've built a benchmark database of slightly over 100,000 documents from
 wikipedia, and indexed these with flint and chert.  When searching the
 resulting databases, with 10,000 single term searches, with the database
 fully cached, the flint database completes all searches in 1.78 seconds
 (averaging 0.000178 seconds/query), whereas the chert database completes
 all searches in 12.58 seconds (averaging 0.001258 seconds/query) - ie,
 chert is about 7 times slower than flint.

 Note that the chert database is considerably smaller than the flint
 database, which hopefully means that in the uncached case chert might
 perform better.  However, with databases under 1m documents, we're likely
 to be IO bound, and performance will be much worse with chert.

 Profiling the code with callgrind revealed that around 85% of the CPU time
 is being spent in get_doclength() calls, and around 90% of that time is
 spent in ChertPostList::move_forward_in_chunk_to_at_least().  This method
 calls next_in_chunk() repeatedly to find the appropriate doclen: on
 average, it calls next_in_chunk() around 30 times per call.

 I don't think this degree of slowdown is acceptable, so we need to either
 find a way to make the code faster with the existing datastructure, or
 find a way to allow faster seeking in the doclen list.

 I've done some experiments about this, which are detailed in the comments.
 So far, a combination of avoid_string_operations.patch and
 optimise_unpack_uint.patch give the best result, but still result in
 searches 4.5 times slower than with flint.  I think a redesign of the way
 doclen information is stored is needed to get acceptable speeds.

--

-- 
Ticket URL: <http://trac.xapian.org/ticket/326#comment:20>
Xapian <http://xapian.org/>
Xapian