Weighting Scheme Project - Doubts

Mon May 2 17:13:05 BST 2016

On Sat, Apr 30, 2016 at 04:59:17PM +0530, Vivek Pal wrote:

> 1. I shall be improving existing weighting schemes which can be done in two
> ways, it seems. First, discard and modify the existing functions and second
> is that we retain the existing functions and provide modified functions
> alongside as added functionalities to the users. Both methods would involve
> making changes in the existing weighting function source code in
> xapian-core/weight.

I don't think we want to outright replace existing systems, but if we
can introduce the variants using optional parameters that default to
(effectively) 'off' that might be better than distinct ones,
particularly since I suspect (for instance) BM25 and BM25+ will end up
sharing a lot of their code.

> 2. I am not sure about the whole testing process for the weighting
> schemes.

You're aiming to demonstrate that they make the correct calculations
both in normal use and at the various limits, so it's difficult to
give general advice as the limits will depend on the weighting scheme.

But broadly, you need to independently calculate, or independently
verify, the correct outputs for some test sets (you should be able to
use the existing test databases).

One way of looking for edge cases is to think about conditionals,
although in the case of weighting schemes there may also be edge cases
and limits in the calculations that aren't as obvious.

> 3. Performance evaluation of weighting schemes : I'm thinking of using TREC
> dataset collection and calculate Precision or recall and MAP. Would that be
> the right way to go? This will be done in the second half of the coding
> period after the implementation (with testing) part is completely done.

You should talk to Guarav about that, in particular looking at the
evaluation work he did previously
(https://github.com/samuelharden/xapian-evaluation). Basically, what
you're saying is the right approach, but a lot of the work should
already be done for you -- TREC indexing and evaluation, and
implementations of a variety of metrics (including the ones you are
proposing).

We may want to take the opportunity to discuss whether parts or all of
this evaluation framework can be moved into the main Xapian repo, and
if there are changes that will make it easier to use for evaluation in
future.

> 4. Implementation of remaining SMART normalization of tf-idf weighting
> scheme :- Earlier on IRC, Olly had rightly pointed out that it would be
> tricky and would require making some changes to how matcher handles certain
> things in Xapian. Along with that, this pull request
> https://github.com/xapian/xapian/pull/81 has some work done on "max"
> normalization. I won't mind working on it if there are no issues with that.

If Nishad doesn't find time to take this forward, it should be fine
for you to pick up and complete this normalisation.

> I'm thinking of putting these as something like "Additional/optional tasks"
> in my project wiki page. I'm inclined to working on it after I have
> finished the proposed project first.

Yes, that's a good idea. You might want, at the end of the project, to
transfer any remaining ideas and thoughts either into the bug tracker
or to somewhere on the wiki (where people are more likely to find them
if others want to get involved in future).

> Also, I'll be having semester exams starting from the next Monday(May 09,
> 16). So, it' is likely that I'll pause project work and continue thereafter
> as soon as exams get over.

Good luck with them!

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org