[Xapian-devel] Implementing tf-idf weighting scheme in Xapian

aarsh shah aarshkshah1992 at gmail.com
Wed Feb 20 05:09:48 GMT 2013


TF-IDF also has many normalizations which will work based on all the
statistics we currently provide.Ill send in a patch for a new TFIDF weight
class implementing all the normalizations I can with the current statistics
.Once it is up and running,I'll work on rewriting the backend for
additional statistics as you said.My final  aim is to provide all
normalizations mentioned here:-

1.)
http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html

-Regards
-Aarsh

On Tue, Feb 19, 2013 at 11:21 PM, aarsh shah <aarshkshah1992 at gmail.com>wrote:

> Hello guys.I just read up about tf-idf schemes and want to implement it in
> Xapian (with some frequently used normalizations) as it will also give me a
> good hang of implementing a weighting scheme before I start working on
> implementing DFR schemes.
>
> I read the following as references and I think Ive understood it well and
> can write the hack :-
>
> 1.)
> http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
> 2.) http://classes.seattleu.edu/computer_science/csse470/Madani/ABCs.html
> 3.) http://en.wikipedia.org/wiki/Tf%E2%80%93idf
>
> The basic philosophy is that rare terms (terms which occur  in a few
> documents) should be able to give a higher weight to the documents they
> index compared to terms which occur in many documents .Also,the higher the
> within document frequency in the document ,more is the weight  given by the
> term to the document.
>
> The basic formula is W(t,d)=wdf* log(N/termfreq)  .
>
> However,various normalizations can be applied to both wdf and idf.
>
> The extra per document component will be 0 here and so get_maxextra( )
> will return 0 .
>
> Moreover,an upper bound on W(t,d)  for get_maxpart( ) can be found out
> easily for a particular normalization (if I have all the required metrics
> available).
>
> For eg:- If I am using logarithmic normalization for the wdf (within
> document frequency) ,then an upper bound on W(t,d) will be
> (log(wdf_upperbound_)+1)*log(N/termfreq)  as N(collection size) and
> termfreq(number of documents indexed by the term t) will remain constant
> for a given term t.
>
> However,some normalizations for the wdf   include the formula wdfn = wdf /
> max(wdf,d) where max(wdf,d) is the maximum within document frequency of any
> term in the document .This metric is not provided by the need_stat( )
> function of the Xapian::Weight class and so I don't know how to procure
> it.Please can someone help me that ?
>
> I will work on implementing weight normalization (like cosine
> normalization ) once I am done implementing the scheme with various  wdf
> and idf normalizations.
>
> Please let me know what you'll think,want to start working And I'm sorry
> for being late with modifying the stemmer patch based on the feedback,have
> tests going on at university.
>
> -Regards
> -Aarsh
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130220/7f2dce89/attachment.htm>


More information about the Xapian-devel mailing list