[Xapian-devel] FASTER Search

林德 leedeetiger at gmail.com
Sun Jan 20 05:19:33 GMT 2013


On Sat, Jan 19, 2013 at 7:53 AM, Olly Betts <olly at survex.com> wrote:

> On Thu, Jan 17, 2013 at 01:50:25PM +0800, ?????? wrote:
> > I am suffering for slow searching performance on Xapian.
> >
> > I am using Xapian for indexing about 150,000,000 documents.
> > It was implemented in C++;
>
> Which version of Xapian are you using?
>
> What OS is this on?
>
> How big is the database on disk?
>
> How much RAM do you have?
>

I am using xapian-core-1.2.12 on Debian;
the database was about 50G;
the computer has 48G ram;


>
> > The performance of searching was not that fast.
> > e.g. Searching a query, which includes about 20 terms, needs 2 secs avg.
> >
> > For searching, I followed such steps:
> >
> >    1. construct a QueryParser for certain string
> >    2. parse the query to get a Xapian::Query
> >    3. construct an Enquire for searching by calling get_mset method
>
> How large an MSet are you requesting?  Asking for a larger MSet can
> greatly increases the work that needs to be done.
>
> Are you setting any non-default options for the search?
>
> An actual code sample showing what you're profiling would be much
> clearer than a text description...
>
> > here is the function-time-cost for searching:
>
> Have you checked how much time is spent waiting for I/O?
>
> Assuming you're on a Unix-like platform, the simplest way is just to run
> the example with "time" - e.g.
>
> $ time ./test-script
>
> real    0m27.863s
> user    0m4.514s
> sys     0m0.236s
>
> So here there's about 23 seconds unaccounted for, which is likely to be
> waiting for I/O unless the system is otherwise busy.  The user vs sys
> split is also interesting.
>
>
I have a test for time by searching 500 query, the avg time cost are
real  1876ms
user 1649ms
sys   227ms

below is the actual code:

> *search*

std::vector<uint32_t> search(const string& query1, const string& query2,
const string& query3, unsigned offset, unsigned pagesize, double k1, double
k2, double k3, double b, double min_normlen) {
    std::vector<uint32_t> result;

    try {
        Xapian::Query kw_query, unkw_query, cq;
        Xapian::QueryParser query_parser;
        cq = query_parser.parse_query(query1);
        kw_query = query_parser.parse_query(query2);
        unkw_query = query_parser.parse_query(query3);
        Xapian::Query final_query =
Xapian::Query(Xapian::Query::op::OP_OR,Xapian::Query(Xapian::Query::op::OP_SCALE_WEIGHT,kw_query,
8),unkw_query);
        final_query =
Xapian::Query(Xapian::Query::op::OP_AND_MAYBE,Xapian::Query(Xapian::Query::op::OP_SCALE_WEIGHT,cq,
10),final_query);

//db_->r is a Xapian::Database
        Xapian::Enquire enquire(*(db_->r));
        enquire.set_weighting_scheme(Xapian::BM25Weight(k1, k2, k3, b,
min_normlen));
        enquire.set_query(final_query);

        Xapian::MSet mset = enquire.get_mset(offset, pagesize);

        for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
            const string& data = i.get_document().get_data();
            uint32_t reid = atoi(data.c_str());
            result.push_back(reid);
        }
    } catch (Xapian::Error& e) {
        LLOG_ERROR("Xapian error: type = %s, msg = %s", e.get_type(),
e.get_msg().c_str());
    }

    return result;
}

*index*

void index(const string& document, const string& id) {

    Xapian::TermGenerator tg;
    tg.set_stemmer(Xapian::Stem("none"));

    Xapian::Document doc;
    tg.set_document(doc);

    tg.index_text(document);

    // Store all the fields for display purposes.
    doc.set_data(id);

    string idterm = "Q" + id;
    doc.add_boolean_term(idterm);
//db_->w is a Xapian::WritableDatabase
    db_->w->replace_document(idterm, doc);

}




> > samples  %        symbol name
> > 75649    28.0401
>  ChertPostList::move_forward_in_chunk_to_at_least(unsigned
> > int)
> > 30118    11.1635  Xapian::BM25Weight::get_sumpart(unsigned int, unsigned
> > int) const
> > 21291     7.8917  AndMaybePostList::process_next_or_skip_to(double,
> > Xapian::PostingIterator::Internal*)
> > 17803     6.5989  OrPostList::next(double)
> [...]
> >
> > most of the time cost were about chert post list;
>
> That breakdown is not a total surprise - you're (presumably) using the
> chert backend, and post lists are the lists of documents for each term.
>
> But taking 2 seconds on average is significantly slower than I'd expect,
> even for 20 term queries.
>
> > Could I use some separate database for getting faster searching?
>
> If you're only trying to search a subset of the documents for a lot of
> the searches, then splitting those out can help a lot.
>
> > Compacting database will help?
>
> Yes, if it's not already compact, that will probably help.
>
> > How to reduce time cost for chert post list operation?
>
> If you're using 1.2.x (or anything older), make sure you're running the
> latest 1.2.x release (1.2.13 currently), as various optimisation tweaks
> get added with time.
>
> You could also try trunk, which has a few extra optimisations.
>
> Setting (and exporting) XAPIAN_PREFER_BRASS in the environment before
> you *create* the database will use the brass backend, which has a few
> changes over chert, but probably not anything very significant for
> this case.
>
> Another trick is to set b=0 in BM25Weight, which avoids having to read
> the document length information to calculate the weights.  If your
> documents are all a similar size anyway, the length normalisation
> won't make much difference - e.g. you can do this like so (the 4th
> parameter is b):
>
>     enquire.set_weight(Xapian::BM25Weight(1, 0, 1, 0, 0.5));
>
>
> http://xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html#40ccbf44b5aabe58c1258eb4abcd4df1
>
> Cheers,
>     Olly
>

I have tried to set b = 0, but it didn't make much help.


Best,

De Lin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130120/6d912a8f/attachment.htm>


More information about the Xapian-devel mailing list