[Xapian-devel] Backend for Lucene format indexes-How to get doclength

Tue Aug 20 12:28:42 BST 2013

hi, guys:

I think norm(t, d) in Lucene can used to caculate the number which is
similar to doc length(see norm(t,d) in
http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm).

And this feature is applied into this pull request(
https://github.com/xapian/xapian/pull/25). Here's the informations about
new features and prerformance test:

 This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2,
and not fully tested, I send this patch for wandering if it works for the
idea http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes.
until now, fewer features supported, includes:
1. Single term search.
2. 'AND' search supported, but performance needed to be optimize.
3. Multiple segments.
4. Doc length. Using .nrm instead.

Additonally:
1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported.
These datas are not exsits in Lucene backend, I'v used constant to instead,
so the search results may be not good.
2. Compound file is not suppoted. so Compound file must be disable where
doing index.

I've built a performance test of 1,000,000 documents(actually, I've
download a single file from wiki, which include 1,000,000 lines, I'v treat
one line as a document) from wiki. When doing single term seach,
performance of Lucene backend is as fast as xapian Chert.
Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M.
242 terms, doing single term seach per term, cacultes the total time used
for these 242 searches(results are fluctuant, so I give 10 results per
backend):
1. backend Lucene
1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms,
1551ms
2. backend Chert
1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms,
1809ms

Code for testing is quest.cc, you can look this file for details.

Code for Lucene indexing like this(And Xapian indexing used
example/simpleindex.cc):

    IndexWriter indexWriter = new IndexWriter(directory, new
EnglishAnalyzer(Version.LUCENE_36),
            IndexWriter.MaxFieldLength.UNLIMITED);
    indexWriter.setUseCompoundFile(false); //CompoundFile must be disable
    int lineId = 0;
    while (br.ready()) {  //read lines from input file, each line as a document
        lineId++;
        String origLine = br.readLine();
        origLine = origLine.trim();

        Document doc = new Document();
        doc.add(new Field("data", origLine, Field.Store.YES,
Field.Index.ANALYZED));
        doc.add(new Field("dataorigin", origLine, Field.Store.YES,
                Field.Index.NOT_ANALYZED));
        doc.add(new Field("lid", String.valueOf(lineId), Field.Store.YES,
                Field.Index.NOT_ANALYZED));
        indexWriter.addDocument(doc);
    }

2013/6/17 Richard Boulton <richard at tartarus.org>

> Ah, a quick follow-on from that: read
> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html
>
> There's a per-document "norm" which can be stored, which BM25Similarity
> uses to store the document length.  Additional factors can be stored in
> DocValuesFields (which are very similar to document values in Xapian, in
> that they're stored in separate sequences, though are a bit more flexible).
>
>
> On 17 June 2013 16:06, Richard Boulton <richard at tartarus.org> wrote:
>
>> You might want to look at how Lucene has implemented document length
>> lookup for the BM25Similarity class (added in Lucene 4.0):
>>
>>
>> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
>>
>> I assumed they're using a document payload for storing the lengths, but
>> haven't looked into it.
>>
>>
>> On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com> wrote:
>>
>>> *Or do you mean that it's one number per document whereas the other
>>> stats
>>> are per database, so it's harder to store it?*
>>>
>>> yes, I mean this. It's a huge data. If a new doclength list(contains all
>>> the doclength in a list, like chert)
>>> is added by myself, I am concern about:
>>> 1. This doclength list may be the bottlenect in this backend,
>>> http://trac.xapian.org/ticket/326
>>> 2. Change too much above Lucene file format, then it's hard to compare
>>> performance between Xapian and Lucene
>>>
>>> Some ideas:
>>> 1. Using rank algorithm without doclength, such as BM25Weight or
>>> TradWeight without doclength, or tfidfWeight.
>>>     If ranking results will be not good without doclength?
>>> 2. Stores doclength in .prx payload when doing Lucene indexing.
>>>
>>> https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html
>>>     http://searchhub.org/2009/08/05/getting-started-with-payloads/
>>>     But this method has obvious drawback, it's not for general Lucene
>>> index data, if doclength is not stored, this method
>>>     doesn't works
>>>
>>>>
>>>> Any suggestions?
>>>
>>> Regards
>>>
>>> _______________________________________________
>>> Xapian-devel mailing list
>>> Xapian-devel at lists.xapian.org
>>> http://lists.xapian.org/mailman/listinfo/xapian-devel
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130820/58bf7f2b/attachment.html>