[Xapian-devel] Backend for Lucene format indexes-How to get doclength
jiangwen jiang
jiangwen127 at gmail.com
Mon Aug 26 02:41:07 BST 2013
*For now, using weighting schemes which don't use document length is
probably the simplest answer.*
There's tf-idf weighting scheme on svn master, is it suitable for lucene
backend?
*You've made great progress! I've started to look through the pull
request and made some comments via github.
*
Thanks for your comments, I will update the code as soon as possible.*
*
Regards*
*
2013/8/25 Olly Betts <olly at survex.com>
> On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang wrote:
> > I think norm(t, d) in Lucene can used to caculate the number which is
> > similar to doc length(see norm(t,d) in
> >
> http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm
> ).
>
> It sounds similar (especially if document and field boosts aren't in use),
> though some places may rely on the doc_length = sum(wdf) definition - in
> particular, some other measure of length may violate assumptions like
> wdf <= doc_length.
>
> For now, using weighting schemes which don't use document length is
> probably the simplest answer.
>
> > And this feature is applied into this pull request(
> > https://github.com/xapian/xapian/pull/25). Here's the informations about
> > new features and prerformance test:
>
> You've made great progress! I've started to look through the pull
> request and made some comments via github.
>
> > This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2,
> > and not fully tested, I send this patch for wandering if it works for the
> > idea
> http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes.
> > until now, fewer features supported, includes:
> > 1. Single term search.
> > 2. 'AND' search supported, but performance needed to be optimize.
> > 3. Multiple segments.
> > 4. Doc length. Using .nrm instead.
> >
> > Additonally:
> > 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported.
> > These datas are not exsits in Lucene backend, I'v used constant to
> instead,
> > so the search results may be not good.
>
> You should simply not define these methods for your backend - Xapian has
> fall-back versions (used for inmemory) which will then be used. If you
> return some constant which isn't actually a valid bound, the matcher
> will make invalid assumptions while optimising, resulting in incorrect
> search results.
>
> > 2. Compound file is not suppoted. so Compound file must be disable where
> > doing index.
> >
> > I've built a performance test of 1,000,000 documents(actually, I've
> > download a single file from wiki, which include 1,000,000 lines, I'v
> treat
> > one line as a document) from wiki. When doing single term seach,
> > performance of Lucene backend is as fast as xapian Chert.
> > Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M.
> > 242 terms, doing single term seach per term, cacultes the total time used
> > for these 242 searches(results are fluctuant, so I give 10 results per
> > backend):
> > 1. backend Lucene
> > 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms,
> > 1551ms
> > 2. backend Chert
> > 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms,
> > 1809ms
>
> So this benchmark is pretty much meaningless because of the incorrect
> constant bounds in use.
>
> Cheers,
> Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130826/d7cbc032/attachment.html>
More information about the Xapian-devel
mailing list