[Xapian-tickets] [Xapian] #326: Searches with chert are slow, due to slow doclen access

Thu Feb 5 18:19:42 GMT 2009

#326: Searches with chert are slow, due to slow doclen access
---------------------------+------------------------------------------------
 Reporter:  richard        |        Owner:  richard  
     Type:  defect         |       Status:  assigned 
 Priority:  normal         |    Milestone:  1.1.0    
Component:  Backend-Chert  |      Version:  SVN trunk
 Severity:  normal         |   Resolution:           
 Keywords:                 |    Blockedby:           
 Platform:  All            |     Blocking:           
---------------------------+------------------------------------------------
Description changed by richard:

Old description:

> I've built a benchmark database of slightly over 100,000 documents from
> wikipedia, and indexed these with flint and chert.  When searching the
> resulting databases, with 10,000 single term searches, with the database
> fully cached, the flint database completes all searches in 1.78 seconds,
> whereas the chert database completes all searches in 12.58 seconds - ie,
> chert is about 7 times slower than flint.
>
> Note that the chert database is considerably smaller than the flint
> database, which hopefully means that in the uncached case chert might
> perform better.  However, with databases under 1m documents, we're likely
> to be IO bound, and performance will be much worse with chert.
>
> Profiling the code with callgrind revealed that around 85% of the CPU
> time is being spent in get_doclength() calls, and around 90% of that time
> is spent in ChertPostList::move_forward_in_chunk_to_at_least().  This
> method calls next_in_chunk() repeatedly to find the appropriate doclen:
> on average, it calls next_in_chunk() around 30 times per call.
>
> I don't think this degree of slowdown is acceptable, so we need to either
> find a way to make the code faster with the existing datastructure, or
> find a way to allow faster seeking in the doclen list.
>
> I've done some experiments about this, which I'll detail in subsequent
> comments.

New description:

 I've built a benchmark database of slightly over 100,000 documents from
 wikipedia, and indexed these with flint and chert.  When searching the
 resulting databases, with 10,000 single term searches, with the database
 fully cached, the flint database completes all searches in 1.78 seconds,
 whereas the chert database completes all searches in 12.58 seconds - ie,
 chert is about 7 times slower than flint.

 Note that the chert database is considerably smaller than the flint
 database, which hopefully means that in the uncached case chert might
 perform better.  However, with databases under 1m documents, we're likely
 to be IO bound, and performance will be much worse with chert.

 Profiling the code with callgrind revealed that around 85% of the CPU time
 is being spent in get_doclength() calls, and around 90% of that time is
 spent in ChertPostList::move_forward_in_chunk_to_at_least().  This method
 calls next_in_chunk() repeatedly to find the appropriate doclen: on
 average, it calls next_in_chunk() around 30 times per call.

 I don't think this degree of slowdown is acceptable, so we need to either
 find a way to make the code faster with the existing datastructure, or
 find a way to allow faster seeking in the doclen list.

 I've done some experiments about this, which are detailed in the comments.
 So far, a combination of avoid_string_operations.patch and
 optimise_unpack_uint.patch give the best result, but still result in
 searches 4.5 times slower than with flint.  I think a redesign of the way
 doclen information is stored is needed to get acceptable speeds.

--

-- 
Ticket URL: <http://trac.xapian.org/ticket/326#comment:4>
Xapian <http://xapian.org/>
Xapian