GSoC '17: Reintroducing myself

Sun Mar 19 05:12:32 GMT 2017

Generally this all sounds sensible to me.  A few comments below:

On Sat, Mar 11, 2017 at 08:22:35PM +0530, Vivek Pal wrote:
> 1. Raw click data can be obtained from Omega logs. If there's currently no
> functionality for that then the very first step will be to implement a
> logging facility in Omega or may be even a standalone proxy-log server to
> record the click data.[1] We'd need different functionalities in that
> logging facility to extract the following type of information depending
> upon the mining technique we choose to employ:

There's a $log{} command available in Omega templates.  We can't log from
the result page template, as the clicks happen after that is used, but we
could make result links redirect via a second Omega template which does
the logging.

> But position bias may strongly affect the accuracy of pairwise preference
> learning so we need a position bias free method for the learning task.
> Radlinski and Joachims [3] gave "a simple method to modify the presentation
> of search results that provably gives relevance judgements that are
> unaffected by presentation bias" called simple FairPairs algorithm. The
> modified search result is presented to the user and click data is extracted
> and mined thereafter.

So for that you'd also need to implement this result modification and then
to use that new feature from Omega.

> And, there are also several sequential click models that use the hypothesis
> that there is no position bias but that doesn't sound like a good solution
> so I think it's best to focus on preference pair learning models.

That indeed seems an unrealistic assumption, though I guess what really
matters is how effective these models are in practice (after all, models
are almost inherently simplifications of reality).

> The most fundamental question still remains unanswered for me after going
> through all these papers that how the final binary relevance judgements are
> assigned to the docs in the search results. I think once we have the
> relevance judgements for Qrel file, we are pretty much done as rest of the
> things starting from generating a training file is handled by letor itself.

Yes, that seems the appropriate boundary with the existing xapian-letor
module.

Cheers,
    Olly