GSoc 2017 Introduction(Weighting Schemes)

Olly Betts olly at survex.com
Tue Mar 21 04:30:37 GMT 2017


On Mon, Mar 13, 2017 at 03:30:44AM +0530, prachi prakash wrote:
> Thanks for an early reply. I looked a bit deep into the tf-idf
> implementation and found that the following document length normalizations
> are not implemented [1].
> 
> 1) Cosine normalization
> 2)Sum of weights normalization
> 3) Fourth Normalization
> 4) Max weight normalization

There's also the "pivoted unique" normalisation, as linked from the project
idea resources list.

> All the normalization factor being a constant at the document level, for
> each combination of wdf and idf weighting scheme (that are already
> implemented)  the above document normalization factors should be stored in
> the backend(index).

Unless the IDF norm is "none", these norms can't just be factored out of
the equation.  For example, consider SMART "bfm" - there we need the maximum
value of 1/n(t) for any term in the query which occurs in the document
being weighted, where n(t) is the number of different documents which term t
occurs in (Xapian calls this "term frequency" but that phrase is sadly
overloaded with multiple meanings in the literature).  That's not a
per-document constant factor you can pre-compute.

Cheers,
    Olly



More information about the Xapian-devel mailing list