<div dir="ltr"><div>Hi, I have some doubts regarding the implementation of weight normalization schemes(eg: cosine wt norm.) in TfIdf. To implement these, the weights of all the terms in a document are needed. If the score of each document is available seperately, it can be normalized by dividing each document score with the sqr-root of the sum of squared-TfIdfWeights of individual terms in that document. </div><div>Reference: <a href="http://www.ics.uci.edu/~djp3/classes/2008_09_26_CS221/Lectures/Lecture26.pdf">http://www.ics.uci.edu/~djp3/classes/2008_09_26_CS221/Lectures/Lecture26.pdf</a></div><div><br></div><div>In the Xapian code I tried searching for such a list of scores of documents that contain query terms, but couldn't find any. I didn't completely understand the working of MultiMatch::get_mset() which produces the list of relevant items. It would be great if someone can provide information about the workings of this method in some detail, and how the scores of individual documents can be retreived so as to compute normalizations on each of them. I have read <a href="http://xapian.org/docs/matcherdesign.html">http://xapian.org/docs/matcherdesign.html</a> , but I did not understand the exact functioning. The details of the matching process will be needed to implement this normalization.</div><div><br></div><div>In the current TfIdf weighting scheme's get_sumpart() method, there is a method which gets called before returning the final wt : get_wtn() . The proposed normalization is supposed to be implemented here. But I don't think that it is possible to calculate the normalized weight from this point as we need the weights contributed by every term in this particular document. It would probably be costly to calculate each terms weight in this method. Hence, as I've mentioned before, it would be a good idea to carry out this normalization after initial document scores have been calculated. </div></div>