Hello guys.I just read up about tf-idf schemes and want to implement it in Xapian (with some frequently used normalizations) as it will also give me a good hang of implementing a weighting scheme before I start working on implementing DFR schemes.<br>
<br>I read the following as references and I think Ive understood it well and can write the hack :-<br><br>1.) <a href="http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html">http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html</a><br>
2.) <a href="http://classes.seattleu.edu/computer_science/csse470/Madani/ABCs.html">http://classes.seattleu.edu/computer_science/csse470/Madani/ABCs.html</a><br>3.) <a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">http://en.wikipedia.org/wiki/Tf%E2%80%93idf</a><br>
<br>The basic philosophy is that rare terms (terms which occur in a few documents) should be able to give a higher weight to the documents they index compared to terms which occur in many documents .Also,the higher the within document frequency in the document ,more is the weight given by the term to the document.<br>
<br>The basic formula is W(t,d)=wdf* log(N/termfreq) . <br><br> However,various normalizations can be applied to both wdf and idf.<br><br>The extra per document component will be 0 here and so get_maxextra( ) will return 0 .<br>
<br>Moreover,an upper bound on W(t,d) for get_maxpart( ) can be found out easily for a particular normalization (if I have all the required metrics available).<br><br>For eg:- If I am using logarithmic normalization for the wdf (within document frequency) ,then an upper bound on W(t,d) will be (log(wdf_upperbound_)+1)*log(N/termfreq) as N(collection size) and termfreq(number of documents indexed by the term t) will remain constant for a given term t. <br>
<br>However,some normalizations for the wdf include the formula wdfn = wdf / max(wdf,d) where max(wdf,d) is the maximum within document frequency of any term in the document .This metric is not provided by the need_stat( ) function of the Xapian::Weight class and so I don't know how to procure it.Please can someone help me that ? <br>
<br>I will work on implementing weight normalization (like cosine normalization ) once I am done implementing the scheme with various wdf and idf normalizations.<br><br>Please let me know what you'll think,want to start working And I'm sorry for being late with modifying the stemmer patch based on the feedback,have tests going on at university.<br>
<br>-Regards<br>-Aarsh<br> <br>