[Xapian-devel] FASTER Search
林德
leedeetiger at gmail.com
Sun Jan 20 05:19:33 GMT 2013
On Sat, Jan 19, 2013 at 7:53 AM, Olly Betts <olly at survex.com> wrote:
> On Thu, Jan 17, 2013 at 01:50:25PM +0800, ?????? wrote:
> > I am suffering for slow searching performance on Xapian.
> >
> > I am using Xapian for indexing about 150,000,000 documents.
> > It was implemented in C++;
>
> Which version of Xapian are you using?
>
> What OS is this on?
>
> How big is the database on disk?
>
> How much RAM do you have?
>
I am using xapian-core-1.2.12 on Debian;
the database was about 50G;
the computer has 48G ram;
>
> > The performance of searching was not that fast.
> > e.g. Searching a query, which includes about 20 terms, needs 2 secs avg.
> >
> > For searching, I followed such steps:
> >
> > 1. construct a QueryParser for certain string
> > 2. parse the query to get a Xapian::Query
> > 3. construct an Enquire for searching by calling get_mset method
>
> How large an MSet are you requesting? Asking for a larger MSet can
> greatly increases the work that needs to be done.
>
> Are you setting any non-default options for the search?
>
> An actual code sample showing what you're profiling would be much
> clearer than a text description...
>
> > here is the function-time-cost for searching:
>
> Have you checked how much time is spent waiting for I/O?
>
> Assuming you're on a Unix-like platform, the simplest way is just to run
> the example with "time" - e.g.
>
> $ time ./test-script
>
> real 0m27.863s
> user 0m4.514s
> sys 0m0.236s
>
> So here there's about 23 seconds unaccounted for, which is likely to be
> waiting for I/O unless the system is otherwise busy. The user vs sys
> split is also interesting.
>
>
I have a test for time by searching 500 query, the avg time cost are
real 1876ms
user 1649ms
sys 227ms
below is the actual code:
> *search*
std::vector<uint32_t> search(const string& query1, const string& query2,
const string& query3, unsigned offset, unsigned pagesize, double k1, double
k2, double k3, double b, double min_normlen) {
std::vector<uint32_t> result;
try {
Xapian::Query kw_query, unkw_query, cq;
Xapian::QueryParser query_parser;
cq = query_parser.parse_query(query1);
kw_query = query_parser.parse_query(query2);
unkw_query = query_parser.parse_query(query3);
Xapian::Query final_query =
Xapian::Query(Xapian::Query::op::OP_OR,Xapian::Query(Xapian::Query::op::OP_SCALE_WEIGHT,kw_query,
8),unkw_query);
final_query =
Xapian::Query(Xapian::Query::op::OP_AND_MAYBE,Xapian::Query(Xapian::Query::op::OP_SCALE_WEIGHT,cq,
10),final_query);
//db_->r is a Xapian::Database
Xapian::Enquire enquire(*(db_->r));
enquire.set_weighting_scheme(Xapian::BM25Weight(k1, k2, k3, b,
min_normlen));
enquire.set_query(final_query);
Xapian::MSet mset = enquire.get_mset(offset, pagesize);
for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
const string& data = i.get_document().get_data();
uint32_t reid = atoi(data.c_str());
result.push_back(reid);
}
} catch (Xapian::Error& e) {
LLOG_ERROR("Xapian error: type = %s, msg = %s", e.get_type(),
e.get_msg().c_str());
}
return result;
}
*index*
void index(const string& document, const string& id) {
Xapian::TermGenerator tg;
tg.set_stemmer(Xapian::Stem("none"));
Xapian::Document doc;
tg.set_document(doc);
tg.index_text(document);
// Store all the fields for display purposes.
doc.set_data(id);
string idterm = "Q" + id;
doc.add_boolean_term(idterm);
//db_->w is a Xapian::WritableDatabase
db_->w->replace_document(idterm, doc);
}
> > samples % symbol name
> > 75649 28.0401
> ChertPostList::move_forward_in_chunk_to_at_least(unsigned
> > int)
> > 30118 11.1635 Xapian::BM25Weight::get_sumpart(unsigned int, unsigned
> > int) const
> > 21291 7.8917 AndMaybePostList::process_next_or_skip_to(double,
> > Xapian::PostingIterator::Internal*)
> > 17803 6.5989 OrPostList::next(double)
> [...]
> >
> > most of the time cost were about chert post list;
>
> That breakdown is not a total surprise - you're (presumably) using the
> chert backend, and post lists are the lists of documents for each term.
>
> But taking 2 seconds on average is significantly slower than I'd expect,
> even for 20 term queries.
>
> > Could I use some separate database for getting faster searching?
>
> If you're only trying to search a subset of the documents for a lot of
> the searches, then splitting those out can help a lot.
>
> > Compacting database will help?
>
> Yes, if it's not already compact, that will probably help.
>
> > How to reduce time cost for chert post list operation?
>
> If you're using 1.2.x (or anything older), make sure you're running the
> latest 1.2.x release (1.2.13 currently), as various optimisation tweaks
> get added with time.
>
> You could also try trunk, which has a few extra optimisations.
>
> Setting (and exporting) XAPIAN_PREFER_BRASS in the environment before
> you *create* the database will use the brass backend, which has a few
> changes over chert, but probably not anything very significant for
> this case.
>
> Another trick is to set b=0 in BM25Weight, which avoids having to read
> the document length information to calculate the weights. If your
> documents are all a similar size anyway, the length normalisation
> won't make much difference - e.g. you can do this like so (the 4th
> parameter is b):
>
> enquire.set_weight(Xapian::BM25Weight(1, 0, 1, 0, 0.5));
>
>
> http://xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html#40ccbf44b5aabe58c1258eb4abcd4df1
>
> Cheers,
> Olly
>
I have tried to set b = 0, but it didn't make much help.
Best,
De Lin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130120/6d912a8f/attachment.htm>
More information about the Xapian-devel
mailing list