GSoc 2017 Introduction(Weighting Schemes)

prachi prakash prachiprakash80 at gmail.com
Sun Mar 12 22:00:44 GMT 2017


Hi Olly,

Thanks for an early reply. I looked a bit deep into the tf-idf
implementation and found that the following document length normalizations
are not implemented [1].

1) Cosine normalization
2)Sum of weights normalization
3) Fourth Normalization
4) Max weight normalization

All the normalization factor being a constant at the document level, for
each combination of wdf and idf weighting scheme (that are already
implemented)  the above document normalization factors should be stored in
the backend(index).

Furthermore, I was thinking  while weighting each term multiplying the
document  normalization factor can be redundant, so can we have a abstract
function like get_mulextra in Weight class which would return a term
independent document normalization factor which can be multiplied to the
weight of the document for the query to get the final weight(rank) of the
document for a particular query.

Please suggest am I thinking in the correct direction.

References:
Nicola Polettini. The Vector Space model in Information Retrieval - Term
Weighting Problem. January 2004.

Regards,
Prachi Prakash
Final year Graduate Student
LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/
github: https://github.com/PrachiPrakash?tab=activity


On Sun, Mar 5, 2017 at 8:41 PM, prachi prakash <prachiprakash80 at gmail.com>
wrote:

> Hello Everyone,
>
> I am a second year graduate student at IIIT-Bangalore and my interest is
> in the field of Information Retrieval. I have successfully compiled Xapian
> from source  and have implemented some examples. While going through the
> project list Weighting Schemes project is the one I was looking to
> contribute to. So i went through the xapian-core/weight where most of the
> schemes are already present and I also went through the Bigram-model which
> was outside the tree and not merged yet.
>
> So can Anyone of please give a pointer to which weighting schemes are not
> implemented yet so that I can start looking at it.
>
> Regards,
> Prachi Prakash
> Final year Graduate Student
> LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/
> github: https://github.com/PrachiPrakash?tab=activity
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170313/14e51780/attachment.html>


More information about the Xapian-devel mailing list