[Xapian-devel] FASTER Search

Olly Betts olly at survex.com
Fri Jan 18 23:53:53 GMT 2013


On Thu, Jan 17, 2013 at 01:50:25PM +0800, ?????? wrote:
> I am suffering for slow searching performance on Xapian.
> 
> I am using Xapian for indexing about 150,000,000 documents.
> It was implemented in C++;

Which version of Xapian are you using?

What OS is this on?

How big is the database on disk?

How much RAM do you have?

> The performance of searching was not that fast.
> e.g. Searching a query, which includes about 20 terms, needs 2 secs avg.
> 
> For searching, I followed such steps:
> 
>    1. construct a QueryParser for certain string
>    2. parse the query to get a Xapian::Query
>    3. construct an Enquire for searching by calling get_mset method

How large an MSet are you requesting?  Asking for a larger MSet can
greatly increases the work that needs to be done.

Are you setting any non-default options for the search?

An actual code sample showing what you're profiling would be much
clearer than a text description...

> here is the function-time-cost for searching:

Have you checked how much time is spent waiting for I/O?

Assuming you're on a Unix-like platform, the simplest way is just to run
the example with "time" - e.g.

$ time ./test-script

real    0m27.863s
user    0m4.514s
sys     0m0.236s

So here there's about 23 seconds unaccounted for, which is likely to be
waiting for I/O unless the system is otherwise busy.  The user vs sys
split is also interesting.

> samples  %        symbol name
> 75649    28.0401  ChertPostList::move_forward_in_chunk_to_at_least(unsigned
> int)
> 30118    11.1635  Xapian::BM25Weight::get_sumpart(unsigned int, unsigned
> int) const
> 21291     7.8917  AndMaybePostList::process_next_or_skip_to(double,
> Xapian::PostingIterator::Internal*)
> 17803     6.5989  OrPostList::next(double)
[...]
> 
> most of the time cost were about chert post list;

That breakdown is not a total surprise - you're (presumably) using the
chert backend, and post lists are the lists of documents for each term.

But taking 2 seconds on average is significantly slower than I'd expect,
even for 20 term queries.

> Could I use some separate database for getting faster searching?

If you're only trying to search a subset of the documents for a lot of
the searches, then splitting those out can help a lot.

> Compacting database will help?

Yes, if it's not already compact, that will probably help.

> How to reduce time cost for chert post list operation?

If you're using 1.2.x (or anything older), make sure you're running the
latest 1.2.x release (1.2.13 currently), as various optimisation tweaks
get added with time.

You could also try trunk, which has a few extra optimisations.

Setting (and exporting) XAPIAN_PREFER_BRASS in the environment before
you *create* the database will use the brass backend, which has a few
changes over chert, but probably not anything very significant for
this case.

Another trick is to set b=0 in BM25Weight, which avoids having to read
the document length information to calculate the weights.  If your
documents are all a similar size anyway, the length normalisation
won't make much difference - e.g. you can do this like so (the 4th
parameter is b):

    enquire.set_weight(Xapian::BM25Weight(1, 0, 1, 0, 0.5));

http://xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html#40ccbf44b5aabe58c1258eb4abcd4df1

Cheers,
    Olly



More information about the Xapian-devel mailing list