[Xapian-discuss] Two questions

Olly Betts olly at survex.com
Wed May 4 00:18:49 BST 2005


On Wed, May 04, 2005 at 12:19:45AM +0200, roki roki wrote:
> I want to mention that I have compiled Xapian with BM25Weight() : k1(10000),
> k2(0), k3(0), b(0), min_normlen(0.5) because I need to get documents with
> most frequencies at the top.
> 
> I also use only add_term with a lot frequencies (based on "pagerank" and
> html formatting)  and for adding/replacing documents I use the nest
> procedure:
> 
> $code= unique id from my mysql database
> 
> $doc = Search::Xapian::Document->new(); 
>  
> $doc->set_data("$code"); 
> 
> $doc->add_term("blablabla", 500);
> 
> $database->replace_document($code,$doc);

With those BM25 parameters, BM25Weight will calculate the maximum value
a term can add to the document as 10001 * T (where T depends on the # of
documents in the index and the term frequency), but each term will
actually add wdf * 10001 * T / (10000 + wdf) - so the ratio expected:max
is wdf / (10000 + wdf) : 1 , so when wdf = 500 that's about 0.0476:1!

If wdf = 500 is typical this means some of the matcher's optimisations
will never get a chance to operate.  Combine that with collapsing on a
value, and this probably explains the slow searches.

If you really just want to weight linearly on wdf, you could implement
your own weighting scheme which returns min(wdf, CEILING) as the
sumpart and CEILING as the maxpart (for some suitable constant CEILING).

> > "By 60%" would be quite an extreme reduction (even by 40% is rather more
> > than I've generally seen), but lots of replacement can leave blocks less
> > full.  If you can provide a program and datafiles I can use to reproduce
> > this, I'll take a look and see if this can be improved.

> I can create Xapian database with a few hundreds documents and sent it to
> you if you want.

It would be much more useful to have something to generate such a
database (e.g. a perl script and any datafiles it might need).  Then I
can try it on modified versions of the quartz backend to see if changes
to how deletes are handled help.

Cheers,
    Olly



More information about the Xapian-discuss mailing list