[Xapian-tickets] [Xapian] #326: Change doc length chunk encoding so skipping through a chunk is better than O(n)
Xapian
nobody at xapian.org
Sat Mar 24 13:05:52 GMT 2012
#326: Change doc length chunk encoding so skipping through a chunk is better than
O(n)
---------------------------+------------------------------------------------
Reporter: richard | Owner: olly
Type: defect | Status: assigned
Priority: normal | Milestone: 1.2.x
Component: Backend-Brass | Version: SVN trunk
Severity: normal | Keywords:
Blockedby: | Platform: All
Blocking: |
---------------------------+------------------------------------------------
Description changed by olly:
Old description:
> I've built a benchmark database of slightly over 100,000 documents from
> wikipedia, and indexed these with flint and chert. When searching the
> resulting databases, with 10,000 single term searches, with the database
> fully cached, the flint database completes all searches in 1.78 seconds,
> whereas the chert database completes all searches in 12.58 seconds - ie,
> chert is about 7 times slower than flint.
>
> Note that the chert database is considerably smaller than the flint
> database, which hopefully means that in the uncached case chert might
> perform better. However, with databases under 1m documents, we're likely
> to be IO bound, and performance will be much worse with chert.
>
> Profiling the code with callgrind revealed that around 85% of the CPU
> time is being spent in get_doclength() calls, and around 90% of that time
> is spent in ChertPostList::move_forward_in_chunk_to_at_least(). This
> method calls next_in_chunk() repeatedly to find the appropriate doclen:
> on average, it calls next_in_chunk() around 30 times per call.
>
> I don't think this degree of slowdown is acceptable, so we need to either
> find a way to make the code faster with the existing datastructure, or
> find a way to allow faster seeking in the doclen list.
>
> I've done some experiments about this, which are detailed in the
> comments. So far, a combination of avoid_string_operations.patch and
> optimise_unpack_uint.patch give the best result, but still result in
> searches 4.5 times slower than with flint. I think a redesign of the way
> doclen information is stored is needed to get acceptable speeds.
New description:
I've built a benchmark database of slightly over 100,000 documents from
wikipedia, and indexed these with flint and chert. When searching the
resulting databases, with 10,000 single term searches, with the database
fully cached, the flint database completes all searches in 1.78 seconds
(averaging 0.000178 seconds/query), whereas the chert database completes
all searches in 12.58 seconds (averaging 0.001258 seconds/query) - ie,
chert is about 7 times slower than flint.
Note that the chert database is considerably smaller than the flint
database, which hopefully means that in the uncached case chert might
perform better. However, with databases under 1m documents, we're likely
to be IO bound, and performance will be much worse with chert.
Profiling the code with callgrind revealed that around 85% of the CPU time
is being spent in get_doclength() calls, and around 90% of that time is
spent in ChertPostList::move_forward_in_chunk_to_at_least(). This method
calls next_in_chunk() repeatedly to find the appropriate doclen: on
average, it calls next_in_chunk() around 30 times per call.
I don't think this degree of slowdown is acceptable, so we need to either
find a way to make the code faster with the existing datastructure, or
find a way to allow faster seeking in the doclen list.
I've done some experiments about this, which are detailed in the comments.
So far, a combination of avoid_string_operations.patch and
optimise_unpack_uint.patch give the best result, but still result in
searches 4.5 times slower than with flint. I think a redesign of the way
doclen information is stored is needed to get acceptable speeds.
--
--
Ticket URL: <http://trac.xapian.org/ticket/326#comment:20>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list