<div dir="ltr"><br><div><u><i>For now, using weighting schemes which don't use document length is<br>
probably the simplest answer.</i></u><br>
<div><br>There's tf-idf weighting scheme on svn master, is it suitable for lucene backend?<br></div><div>
<br>
</div><u><i>You've made great progress! I've started to look through the pull<br>
request and made some comments via github.<br><br></i></u></div><div>Thanks for your comments, I will update the code as soon as possible.<u><i><br><br></i></u></div><div>Regards<u><i><br></i></u></div></div><div class="gmail_extra">
<br><br><div class="gmail_quote">2013/8/25 Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang wrote:<br>
> I think norm(t, d) in Lucene can used to caculate the number which is<br>
> similar to doc length(see norm(t,d) in<br>
> <a href="http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm" target="_blank">http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm</a>).<br>
<br>
</div>It sounds similar (especially if document and field boosts aren't in use),<br>
though some places may rely on the doc_length = sum(wdf) definition - in<br>
particular, some other measure of length may violate assumptions like<br>
wdf <= doc_length.<br>
<br>
For now, using weighting schemes which don't use document length is<br>
probably the simplest answer.<br>
<div class="im"><br>
> And this feature is applied into this pull request(<br>
> <a href="https://github.com/xapian/xapian/pull/25" target="_blank">https://github.com/xapian/xapian/pull/25</a>). Here's the informations about<br>
> new features and prerformance test:<br>
<br>
</div>You've made great progress! I've started to look through the pull<br>
request and made some comments via github.<br>
<div class="im"><br>
> This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2,<br>
> and not fully tested, I send this patch for wandering if it works for the<br>
> idea <a href="http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes" target="_blank">http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes</a>.<br>
> until now, fewer features supported, includes:<br>
> 1. Single term search.<br>
> 2. 'AND' search supported, but performance needed to be optimize.<br>
> 3. Multiple segments.<br>
> 4. Doc length. Using .nrm instead.<br>
><br>
> Additonally:<br>
> 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported.<br>
> These datas are not exsits in Lucene backend, I'v used constant to instead,<br>
> so the search results may be not good.<br>
<br>
</div>You should simply not define these methods for your backend - Xapian has<br>
fall-back versions (used for inmemory) which will then be used. If you<br>
return some constant which isn't actually a valid bound, the matcher<br>
will make invalid assumptions while optimising, resulting in incorrect<br>
search results.<br>
<div class="im"><br>
> 2. Compound file is not suppoted. so Compound file must be disable where<br>
> doing index.<br>
><br>
> I've built a performance test of 1,000,000 documents(actually, I've<br>
> download a single file from wiki, which include 1,000,000 lines, I'v treat<br>
> one line as a document) from wiki. When doing single term seach,<br>
> performance of Lucene backend is as fast as xapian Chert.<br>
> Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M.<br>
> 242 terms, doing single term seach per term, cacultes the total time used<br>
> for these 242 searches(results are fluctuant, so I give 10 results per<br>
> backend):<br>
> 1. backend Lucene<br>
> 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms,<br>
> 1551ms<br>
> 2. backend Chert<br>
> 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms,<br>
> 1809ms<br>
<br>
</div>So this benchmark is pretty much meaningless because of the incorrect<br>
constant bounds in use.<br>
<br>
Cheers,<br>
Olly<br>
</blockquote></div><br></div>