[Xapian-devel] Implementing tf-idf weighting scheme in Xapian

aarsh shah aarshkshah1992 at gmail.com
Tue Feb 19 17:51:14 GMT 2013


Hello guys.I just read up about tf-idf schemes and want to implement it in
Xapian (with some frequently used normalizations) as it will also give me a
good hang of implementing a weighting scheme before I start working on
implementing DFR schemes.

I read the following as references and I think Ive understood it well and
can write the hack :-

1.)
http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
2.) http://classes.seattleu.edu/computer_science/csse470/Madani/ABCs.html
3.) http://en.wikipedia.org/wiki/Tf%E2%80%93idf

The basic philosophy is that rare terms (terms which occur  in a few
documents) should be able to give a higher weight to the documents they
index compared to terms which occur in many documents .Also,the higher the
within document frequency in the document ,more is the weight  given by the
term to the document.

The basic formula is W(t,d)=wdf* log(N/termfreq)  .

However,various normalizations can be applied to both wdf and idf.

The extra per document component will be 0 here and so get_maxextra( ) will
return 0 .

Moreover,an upper bound on W(t,d)  for get_maxpart( ) can be found out
easily for a particular normalization (if I have all the required metrics
available).

For eg:- If I am using logarithmic normalization for the wdf (within
document frequency) ,then an upper bound on W(t,d) will be
(log(wdf_upperbound_)+1)*log(N/termfreq)  as N(collection size) and
termfreq(number of documents indexed by the term t) will remain constant
for a given term t.

However,some normalizations for the wdf   include the formula wdfn = wdf /
max(wdf,d) where max(wdf,d) is the maximum within document frequency of any
term in the document .This metric is not provided by the need_stat( )
function of the Xapian::Weight class and so I don't know how to procure
it.Please can someone help me that ?

I will work on implementing weight normalization (like cosine normalization
) once I am done implementing the scheme with various  wdf and idf
normalizations.

Please let me know what you'll think,want to start working And I'm sorry
for being late with modifying the stemmer patch based on the feedback,have
tests going on at university.

-Regards
-Aarsh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130219/a954368a/attachment.htm>


More information about the Xapian-devel mailing list