<br><br><div class="gmail_quote">On Sat, Jan 19, 2013 at 7:53 AM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">On Thu, Jan 17, 2013 at 01:50:25PM +0800, ?????? wrote:<br>
> I am suffering for slow searching performance on Xapian.<br>
><br>
> I am using Xapian for indexing about 150,000,000 documents.<br>
> It was implemented in C++;<br>
<br>
</div>Which version of Xapian are you using?<br>
<br>
What OS is this on?<br>
<br>
How big is the database on disk?<br>
<br>
How much RAM do you have?<br></blockquote><div><br></div><div>I am using xapian-core-1.2.12 on Debian;</div><div>the database was about 50G;</div><div>the computer has 48G ram;</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> The performance of searching was not that fast.<br>
> e.g. Searching a query, which includes about 20 terms, needs 2 secs avg.<br>
><br>
> For searching, I followed such steps:<br>
><br>
</div>> 1. construct a QueryParser for certain string<br>
> 2. parse the query to get a Xapian::Query<br>
> 3. construct an Enquire for searching by calling get_mset method<br>
<br>
How large an MSet are you requesting? Asking for a larger MSet can<br>
greatly increases the work that needs to be done.<br>
<br>
Are you setting any non-default options for the search?<br>
<br>
An actual code sample showing what you're profiling would be much<br>
clearer than a text description...<br>
<div class="im"><br>
> here is the function-time-cost for searching:<br>
<br>
</div>Have you checked how much time is spent waiting for I/O?<br>
<br>
Assuming you're on a Unix-like platform, the simplest way is just to run<br>
the example with "time" - e.g.<br>
<br>
$ time ./test-script<br>
<br>
real 0m27.863s<br>
user 0m4.514s<br>
sys 0m0.236s<br>
<br>
So here there's about 23 seconds unaccounted for, which is likely to be<br>
waiting for I/O unless the system is otherwise busy. The user vs sys<br>
split is also interesting.<br>
<div class="im"><br></div></blockquote><div><br></div><div>I have a test for time by searching 500 query, the avg time cost are</div><div>real 1876ms </div><div>user 1649ms </div><div>sys 227ms</div><div><br></div><div>
below is the actual code:</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><font color="#ff0000"><i><b>search</b></i></font></blockquote>
<div><div><div>std::vector<uint32_t> search(const string& query1, const string& query2, const string& query3, unsigned offset, unsigned pagesize, double k1, double k2, double k3, double b, double min_normlen) {</div>
<div> std::vector<uint32_t> result;</div><div><br></div><div> try {</div><div> Xapian::Query kw_query, unkw_query, cq;</div><div><div> Xapian::QueryParser query_parser;</div><div> cq = query_parser.parse_query(query1);</div>
<div> kw_query = query_parser.parse_query(query2);</div><div> unkw_query = query_parser.parse_query(query3);</div></div><div> Xapian::Query final_query = Xapian::Query(Xapian::Query::op::OP_OR,Xapian::Query(Xapian::Query::op::OP_SCALE_WEIGHT,kw_query, 8),unkw_query);</div>
<div> final_query = Xapian::Query(Xapian::Query::op::OP_AND_MAYBE,Xapian::Query(Xapian::Query::op::OP_SCALE_WEIGHT,cq, 10),final_query);</div><div><br></div><div>//db_->r is a Xapian::Database </div><div> Xapian::Enquire enquire(*(db_->r));</div>
<div> enquire.set_weighting_scheme(Xapian::BM25Weight(k1, k2, k3, b, min_normlen));</div><div> enquire.set_query(final_query);</div><div><br></div><div> Xapian::MSet mset = enquire.get_mset(offset, pagesize);</div>
<div><br></div><div> for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {</div><div> const string& data = i.get_document().get_data();</div><div> uint32_t reid = atoi(data.c_str());</div>
<div> result.push_back(reid);</div><div> }</div><div> } catch (Xapian::Error& e) {</div><div> LLOG_ERROR("Xapian error: type = %s, msg = %s", e.get_type(), e.get_msg().c_str());</div>
<div> }</div><div><br></div><div> return result;</div><div>}</div></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<i><b><font color="#ff0000">index</font></b></i></blockquote></div><div><div>void index(const string& document, const string& id) {</div><div><br></div><div> Xapian::TermGenerator tg;</div><div> tg.set_stemmer(Xapian::Stem("none"));</div>
<div><br></div><div> Xapian::Document doc;</div><div> tg.set_document(doc);</div><div><br></div><div> tg.index_text(document);</div><div><br></div><div> // Store all the fields for display purposes.</div><div>
doc.set_data(id);</div><div><br></div><div> string idterm = "Q" + id;</div><div> doc.add_boolean_term(idterm);</div><div>//db_->w is a Xapian::WritableDatabase</div><div> db_->w->replace_document(idterm, doc);</div>
<div><br></div><div>}</div></div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">
> samples % symbol name<br>
> 75649 28.0401 ChertPostList::move_forward_in_chunk_to_at_least(unsigned<br>
> int)<br>
> 30118 11.1635 Xapian::BM25Weight::get_sumpart(unsigned int, unsigned<br>
> int) const<br>
> 21291 7.8917 AndMaybePostList::process_next_or_skip_to(double,<br>
> Xapian::PostingIterator::Internal*)<br>
> 17803 6.5989 OrPostList::next(double)<br>
</div>[...]<br>
<div class="im">><br>
> most of the time cost were about chert post list;<br>
<br>
</div>That breakdown is not a total surprise - you're (presumably) using the<br>
chert backend, and post lists are the lists of documents for each term.<br>
<br>
But taking 2 seconds on average is significantly slower than I'd expect,<br>
even for 20 term queries.<br>
<div class="im"><br>
> Could I use some separate database for getting faster searching?<br>
<br>
</div>If you're only trying to search a subset of the documents for a lot of<br>
the searches, then splitting those out can help a lot.<br>
<br>
> Compacting database will help?<br>
<br>
Yes, if it's not already compact, that will probably help.<br>
<div class="im"><br>
> How to reduce time cost for chert post list operation?<br>
<br>
</div>If you're using 1.2.x (or anything older), make sure you're running the<br>
latest 1.2.x release (1.2.13 currently), as various optimisation tweaks<br>
get added with time.<br>
<br>
You could also try trunk, which has a few extra optimisations.<br>
<br>
Setting (and exporting) XAPIAN_PREFER_BRASS in the environment before<br>
you *create* the database will use the brass backend, which has a few<br>
changes over chert, but probably not anything very significant for<br>
this case.<br>
<br>
Another trick is to set b=0 in BM25Weight, which avoids having to read<br>
the document length information to calculate the weights. If your<br>
documents are all a similar size anyway, the length normalisation<br>
won't make much difference - e.g. you can do this like so (the 4th<br>
parameter is b):<br>
<br>
enquire.set_weight(Xapian::BM25Weight(1, 0, 1, 0, 0.5));<br>
<br>
<a href="http://xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html#40ccbf44b5aabe58c1258eb4abcd4df1" target="_blank">http://xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html#40ccbf44b5aabe58c1258eb4abcd4df1</a><br>
<br>
Cheers,<br>
Olly<br>
</blockquote></div><div><br></div>I have tried to set b = 0, but it didn't make much help.<div><br></div><div><br></div><div>Best,</div><div><br></div><div>De Lin<br><br>
</div>