GSoC '17: Reintroducing myself

Sat Mar 11 14:52:35 GMT 2017

Hello,

I am Vivek Pal, a senior year undergraduate student majoring in Software
Engineering at Delhi Technological University in New Delhi, India. Last
year, I've had the opportunity of working with Xapian on "Weighting
Schemes" project as a GSoC student. I look forward to participating in GSoC
this year as well.

I went through the updated project ideas list and found "Learning to Rank
Click Data Mining" project really interesting which is also the new
addition this year and would like to apply for the same. I had a brief
exchange with Olly regarding the same last week which led me to continue
further discussion here.

After reading a couple of papers[1][2] on clickthrough data mining so far,
I have the following thoughts:

1. Raw click data can be obtained from Omega logs. If there's currently no
functionality for that then the very first step will be to implement a
logging facility in Omega or may be even a standalone proxy-log server to
record the click data.[1] We'd need different functionalities in that
logging facility to extract the following type of information depending
upon the mining technique we choose to employ:

(a) query information, e.g. text and IDs
(b) click-through information, such as doc IDs and timestamps of clicks, as
well as positions of clicked docs in the search result
(c) search results i.e. document ranking
(d) user information, such as IP or any other identifier to uniquely
identify each user. (only if we choose to use a mining technique that also
considers user profiles)

2. Once we have the data, we can use it to generate Query file as used for
learning the existing letor model. But, generating the relevance judgements
for Qrel file is where most of the effort is going to be concentrated. One
method is to mark the clicked docs in the search results as relevant and
mark unclicked docs as not-relevant (assuming that a clicked doc is
implicitly relevant). However, that has some drawbacks such that there
exists position bias in the ranking list i.e higher ranked docs have a
better chance to be clicked and a doc down the search result would unlikely
be clicked even if it is relevant to the query.

Joachims[2] proposed a method based on preference pairs by extracting a
preference relation between each pair of document in the a ranked list of
webdocs and gave five fairly accurate strategies for extracting preference
feedback. I'd like to list atleast a couple here:

(a) A clicked doc is more preferable to the docs skipped above
(b) A clicked doc is more preferable to the docs clicked earlier. These
give us a nice idea about generating the relevance judgements.

But position bias may strongly affect the accuracy of pairwise preference
learning so we need a position bias free method for the learning task.
Radlinski and Joachims [3] gave "a simple method to modify the presentation
of search results that provably gives relevance judgements that are
unaffected by presentation bias" called simple FairPairs algorithm. The
modified search result is presented to the user and click data is extracted
and mined thereafter.

And, there are also several sequential click models that use the hypothesis
that there is no position bias but that doesn't sound like a good solution
so I think it's best to focus on preference pair learning models.

The most fundamental question still remains unanswered for me after going
through all these papers that how the final binary relevance judgements are
assigned to the docs in the search results. I think once we have the
relevance judgements for Qrel file, we are pretty much done as rest of the
things starting from generating a training file is handled by letor itself.

Please let me know what you think.

Thanks,
Vivek Pal

[1] Optimizing Search Engines using Clickthrough Data, Thorsten Joachims.
[2] Accurately Interpreting Clickthrough Data as Implicit Feedback, Thorsten
Joachims et. al.
[3] Minimally Invasive Randomization for Collecting Unbiased Preferences
from Clickthrough Logs, Radlinski and Joachims.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170311/854df75d/attachment.html>