[Xapian-devel] Implementing tf-idf weighting scheme in Xapian

Olly Betts olly at survex.com
Tue Feb 19 22:28:29 GMT 2013


On Tue, Feb 19, 2013 at 11:21:14PM +0530, aarsh shah wrote:
> The basic philosophy is that rare terms (terms which occur  in a few
> documents) should be able to give a higher weight to the documents they
> index compared to terms which occur in many documents .Also,the higher the
> within document frequency in the document ,more is the weight  given by the
> term to the document.
> 
> The basic formula is W(t,d)=wdf* log(N/termfreq)  .
> 
> However,various normalizations can be applied to both wdf and idf.

Both the original probabilistic formula and BM25 actually fit in this
pattern too (aside from the per-document component in BM25).

> The extra per document component will be 0 here and so get_maxextra( ) will
> return 0 .

Indeed.

> Moreover,an upper bound on W(t,d)  for get_maxpart( ) can be found out
> easily for a particular normalization (if I have all the required metrics
> available).
> 
> For eg:- If I am using logarithmic normalization for the wdf (within
> document frequency) ,then an upper bound on W(t,d) will be
> (log(wdf_upperbound_)+1)*log(N/termfreq)  as N(collection size) and
> termfreq(number of documents indexed by the term t) will remain constant
> for a given term t.

Yes.

> However,some normalizations for the wdf   include the formula wdfn = wdf /
> max(wdf,d) where max(wdf,d) is the maximum within document frequency of any
> term in the document .This metric is not provided by the need_stat( )
> function of the Xapian::Weight class and so I don't know how to procure
> it.Please can someone help me that ?

We don't currently store that, and you can't efficiently calculate it on
the fly, so you'd have to alter the backends to store this statistic.

I would suggest you look at the weighting schemes which don't need new
stats first, and then look at ones which do once you're more familiar
with implementing weighting schemes.

Cheers,
    Olly



More information about the Xapian-devel mailing list