GSOC 2016 Project: Weighting Schemes

Sun Mar 6 01:06:17 GMT 2016

On Sat, Mar 05, 2016 at 07:17:40AM +0530, jaideep singh chauhan wrote:

> I went through the list of your proposed projects and the one that
> fascinates me the most is Ranking:Weighting schemes.
> 
> The project proposes incorporation of more weighting schemes and thus what
> I wanted to know was that what kind of schemes are we looking at to
> incorporate, are they various other probabilistic weighting schemes similar
> to BM25 or is there any scope for improvements in the language modeling .
> Both of the above mentioned class of schemes are highly parameter dependent
> thus we can also look into some non-parametric weighting schemes.Based on
> the response I can come up with the pros and cons of the various weighting
> schemes.

Hi, Jaideep -- welcome to Xapian!

In terms of which schemes to implement, we'd want you to propose what
you think is worth adding. You'll note that the project description
talks specifically about further schemes and options from SMART, and
parameter-free DfR.

On the language modelling front, there may be other models worth
considering, although it would be worth showing some recent academic
work that justifies it for IR. Note the comment in Manning et al
(2008, 12.1.2 p241) that:

| However, most language-modeling work in IR has used unigram language
| models. IR is not the place where you most immediately need complex
| language models

Although it does note that there may be value in more sophisticated
models for phrase and proximity queries in particular, with some
references to recent work (12.5, p252).

J

Manning C D, Raghavan P and Schutze H (2008), Introduction to
Information Retrieval, Cambridge University Press. Available online at
<http://nlp.stanford.edu/IR-book/>.

-- 
  James Aylett, occasional trouble-maker
  xapian.org