<div dir="ltr"><br><div><u><i>For now, using weighting schemes which don't use document length is<br>

probably the simplest answer.</i></u><br>

<div><br>There's tf-idf weighting scheme on svn master, is it suitable for lucene backend?<br></div><div>

<br>

</div><u><i>You've made great progress!  I've started to look through the pull<br>

request and made some comments via github.<br><br></i></u></div><div>Thanks for your comments, I will update the code as soon as possible.<u><i><br><br></i></u></div><div>Regards<u><i><br></i></u></div></div><div class="gmail_extra">

<br><br><div class="gmail_quote">2013/8/25 Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang wrote:<br>

> I think norm(t, d) in Lucene can used to caculate the number which is<br>

> similar to doc length(see norm(t,d) in<br>

> <a href="http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm" target="_blank">http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm</a>).<br>


<br>

</div>It sounds similar (especially if document and field boosts aren't in use),<br>

though some places may rely on the doc_length = sum(wdf) definition - in<br>

particular, some other measure of length may violate assumptions like<br>

wdf <= doc_length.<br>

<br>

For now, using weighting schemes which don't use document length is<br>

probably the simplest answer.<br>

<div class="im"><br>

> And this feature is applied into this pull request(<br>

> <a href="https://github.com/xapian/xapian/pull/25" target="_blank">https://github.com/xapian/xapian/pull/25</a>). Here's the informations about<br>

> new features and prerformance test:<br>

<br>

</div>You've made great progress!  I've started to look through the pull<br>

request and made some comments via github.<br>

<div class="im"><br>

>  This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2,<br>

> and not fully tested, I send this patch for wandering if it works for the<br>

> idea <a href="http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes" target="_blank">http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes</a>.<br>

> until now, fewer features supported, includes:<br>

> 1. Single term search.<br>

> 2. 'AND' search supported, but performance needed to be optimize.<br>

> 3. Multiple segments.<br>

> 4. Doc length. Using .nrm instead.<br>

><br>

> Additonally:<br>

> 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported.<br>

> These datas are not exsits in Lucene backend, I'v used constant to instead,<br>

> so the search results may be not good.<br>

<br>

</div>You should simply not define these methods for your backend - Xapian has<br>

fall-back versions (used for inmemory) which will then be used.  If you<br>

return some constant which isn't actually a valid bound, the matcher<br>

will make invalid assumptions while optimising, resulting in incorrect<br>

search results.<br>

<div class="im"><br>

> 2. Compound file is not suppoted. so Compound file must be disable where<br>

> doing index.<br>

><br>

> I've built a performance test of 1,000,000 documents(actually, I've<br>

> download a single file from wiki, which include 1,000,000 lines, I'v treat<br>

> one line as a document) from wiki. When doing single term seach,<br>

> performance of Lucene backend is as fast as xapian Chert.<br>

> Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M.<br>

> 242 terms, doing single term seach per term, cacultes the total time used<br>

> for these 242 searches(results are fluctuant, so I give 10 results per<br>

> backend):<br>

> 1. backend Lucene<br>

> 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms,<br>

> 1551ms<br>

> 2. backend Chert<br>

> 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms,<br>

> 1809ms<br>

<br>

</div>So this benchmark is pretty much meaningless because of the incorrect<br>

constant bounds in use.<br>

<br>

Cheers,<br>

    Olly<br>

</blockquote></div><br></div>