GSoc 2017 Introduction(Weighting Schemes)
prachi prakash
prachiprakash80 at gmail.com
Sun Mar 12 22:00:44 GMT 2017
Hi Olly,
Thanks for an early reply. I looked a bit deep into the tf-idf
implementation and found that the following document length normalizations
are not implemented [1].
1) Cosine normalization
2)Sum of weights normalization
3) Fourth Normalization
4) Max weight normalization
All the normalization factor being a constant at the document level, for
each combination of wdf and idf weighting scheme (that are already
implemented) the above document normalization factors should be stored in
the backend(index).
Furthermore, I was thinking while weighting each term multiplying the
document normalization factor can be redundant, so can we have a abstract
function like get_mulextra in Weight class which would return a term
independent document normalization factor which can be multiplied to the
weight of the document for the query to get the final weight(rank) of the
document for a particular query.
Please suggest am I thinking in the correct direction.
References:
Nicola Polettini. The Vector Space model in Information Retrieval - Term
Weighting Problem. January 2004.
Regards,
Prachi Prakash
Final year Graduate Student
LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/
github: https://github.com/PrachiPrakash?tab=activity
On Sun, Mar 5, 2017 at 8:41 PM, prachi prakash <prachiprakash80 at gmail.com>
wrote:
> Hello Everyone,
>
> I am a second year graduate student at IIIT-Bangalore and my interest is
> in the field of Information Retrieval. I have successfully compiled Xapian
> from source and have implemented some examples. While going through the
> project list Weighting Schemes project is the one I was looking to
> contribute to. So i went through the xapian-core/weight where most of the
> schemes are already present and I also went through the Bigram-model which
> was outside the tree and not merged yet.
>
> So can Anyone of please give a pointer to which weighting schemes are not
> implemented yet so that I can start looking at it.
>
> Regards,
> Prachi Prakash
> Final year Graduate Student
> LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/
> github: https://github.com/PrachiPrakash?tab=activity
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170313/14e51780/attachment.html>
More information about the Xapian-devel
mailing list