GSOC 2017 Project: Learning to Rank Click Data Mining

Mon Mar 13 17:10:37 GMT 2017

I am interested in the project 'Learning to Rank Click Data Mining', and
here is my current understanding about this project:
1. where can we get your click data.  we can extend the omega to supports
log the user's search and clicked documents
2. the specific click data information and format. Based on some paper and
public query dataset format(AOL search query logs[1] and Sogou chinese
query logs[2]), I think click data should contain: user identifier like ip
or cookies, timestamp, query content, the showing document id list, clicked
document id. The specific format is:
User identifier \t timestamp \t query contents \t the showing document id
list \t
clicked document id \n
3. relevance judgements. I have read some papers about the relevance
 judgement by click model. In specific, [3] use a Dynamic Bayesian network
which considers the results set as a whole and takes into account the
influence of the other urls While estimating the relevance of a given url
from click logs, effectively reducing the position bias — urls appearing in
lower positions are less likely to be clicked even if they are relevant.
   [4] propose an efficient discriminative parameter estimation in a
multiple instance learning algorithm (MIL) to automatically produce true
relevance labels for <query, URL> pairs, the basic idea of MIL framework is
that during training, instances are presented in bags, and labels are
provided for the bags rather than individual instances. If a bag is labeled
positive, it is assumed to contain at least one positive instance. A
negative bag means that all instances in the bag are negative. From a
collection of labeled bags, the classifier tries to figure out which
instance in a positive bag is the most “correct”.
4. I do not understand fully about ' what a sensible workflow is, for
people who want this to be run automatically on a regular basis to update
their Letor model ', is it means that some users want to  automatically get
addition training data on a regular basis and then update(retraining) the
Letor module, so we should provide a sensible workflow and docs to them?
Based on the above understanding, here is my plan about the next period:
1. achieve the logging function about omega, the first step is familiar
with omega and save search query and the results successfully.
2. read more papers about click model and put forward an effective way to
judge the relevance based on the paper.
Looking  forward to your opinion and please correct me if I am wrong.
Thanks!

references:
[1]
http://www.researchpipeline.com/mediawiki/index.php?title=AOL_Search_Query_Logs
[2] http://www.sogou.com/labs/resource/q.php
[3] Chapelle O, Zhang Y. A dynamic bayesian network click model for web
search ranking[C]// International Conference on World Wide Web. ACM,
2009:1-10.
[4] Song H, Miao C, Shen Z. Generating true relevance labels in chinese
search engine using clickthrough data[C]// AAAI Conference on Artificial
Intelligence. AAAI Press, 2011:1230-1236.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170314/5e0b3b24/attachment.html>