[Xapian-devel] Backend for Lucene format indexes-How to get doclength

Olly Betts olly at survex.com
Sun Aug 25 02:11:50 BST 2013

On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang wrote:
> I think norm(t, d) in Lucene can used to caculate the number which is
> similar to doc length(see norm(t,d) in
> http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm).

It sounds similar (especially if document and field boosts aren't in use),
though some places may rely on the doc_length = sum(wdf) definition - in
particular, some other measure of length may violate assumptions like
wdf <= doc_length.

For now, using weighting schemes which don't use document length is
probably the simplest answer.

> And this feature is applied into this pull request(
> https://github.com/xapian/xapian/pull/25). Here's the informations about
> new features and prerformance test:

You've made great progress!  I've started to look through the pull
request and made some comments via github.

>  This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2,
> and not fully tested, I send this patch for wandering if it works for the
> idea http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes.
> until now, fewer features supported, includes:
> 1. Single term search.
> 2. 'AND' search supported, but performance needed to be optimize.
> 3. Multiple segments.
> 4. Doc length. Using .nrm instead.
> Additonally:
> 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported.
> These datas are not exsits in Lucene backend, I'v used constant to instead,
> so the search results may be not good.

You should simply not define these methods for your backend - Xapian has
fall-back versions (used for inmemory) which will then be used.  If you
return some constant which isn't actually a valid bound, the matcher
will make invalid assumptions while optimising, resulting in incorrect
search results.

> 2. Compound file is not suppoted. so Compound file must be disable where
> doing index.
> I've built a performance test of 1,000,000 documents(actually, I've
> download a single file from wiki, which include 1,000,000 lines, I'v treat
> one line as a document) from wiki. When doing single term seach,
> performance of Lucene backend is as fast as xapian Chert.
> Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M.
> 242 terms, doing single term seach per term, cacultes the total time used
> for these 242 searches(results are fluctuant, so I give 10 results per
> backend):
> 1. backend Lucene
> 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms,
> 1551ms
> 2. backend Chert
> 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms,
> 1809ms

So this benchmark is pretty much meaningless because of the incorrect
constant bounds in use.


