GSoc 2017 Introduction(Weighting Schemes)
Olly Betts
olly at survex.com
Tue Mar 21 04:30:37 GMT 2017
On Mon, Mar 13, 2017 at 03:30:44AM +0530, prachi prakash wrote:
> Thanks for an early reply. I looked a bit deep into the tf-idf
> implementation and found that the following document length normalizations
> are not implemented [1].
>
> 1) Cosine normalization
> 2)Sum of weights normalization
> 3) Fourth Normalization
> 4) Max weight normalization
There's also the "pivoted unique" normalisation, as linked from the project
idea resources list.
> All the normalization factor being a constant at the document level, for
> each combination of wdf and idf weighting scheme (that are already
> implemented) the above document normalization factors should be stored in
> the backend(index).
Unless the IDF norm is "none", these norms can't just be factored out of
the equation. For example, consider SMART "bfm" - there we need the maximum
value of 1/n(t) for any term in the query which occurs in the document
being weighted, where n(t) is the number of different documents which term t
occurs in (Xapian calls this "term frequency" but that phrase is sadly
overloaded with multiple meanings in the literature). That's not a
per-document constant factor you can pre-compute.
Cheers,
Olly
More information about the Xapian-devel
mailing list