GSOC 2017 Project: Learning to Rank Click Data Mining

Sun Mar 19 11:55:38 GMT 2017

On 13 Mar 2017, at 17:10, YuLun Cai <buptcyl at gmail.com> wrote:

> 1. where can we get your click data.  we can extend the omega to supports log the user's search and clicked documents

I think it's also important to have clear documentation for the logging we use, so that non-omega users can benefit from this work. Note that omega already supports some logging, using $log{}.

> 2. the specific click data information and format. Based on some paper and public query dataset format(AOL search query logs[1] and Sogou chinese query logs[2]), I think click data should contain: user identifier like ip or cookies, timestamp, query content, the showing document id list, clicked document id. 

You need to make a judgement call as to whether training on historical data from the specific Xapian use, or something like the AOL general query logs, is going to produce better results. (I'm guessing that will depend on the method used to generate relevance judgements.)

If you're just using these are references for coming up with a format, then this isn't an issue (but I'd recommend driving your format based on what information you need, rather than on what others have done).

> The specific format is:
> User identifier \t timestamp \t query contents \t the showing document id list \t
> clicked document id \n

It doesn't have to be a user identifier; it can be a search or session id. (This can have different privacy implications, although we don't want to try to give recommendations on that side of things.)

> 3. relevance judgements.

[snip]

It sounds like you're looking at the right kind of resources for this. Your proposal should be detailed about which route(s) you think are most likely to yield a helpful approach, since this will affect your timeline. (If it makes sense to do some tests of different approaches, that is something you can propose to do during the community bonding period, if you have time, or at the start of the project itself.)

> 4. I do not understand fully about ' what a sensible workflow is, for people who want this to be run automatically on a regular basis to update their Letor model ', is it means that some users want to  automatically get addition training data on a regular basis and then update(retraining) the Letor module, so we should provide a sensible workflow and docs to them?

Yes. Particularly for small sites, or startups who might be evolving rapidly, the detail of what constitutes a good search result may change quite rapidly. In general, this will always change over time even with large and stable sites. So it's important to be able to update the trained Letor model.

> Based on the above understanding, here is my plan about the next period:
> 1. achieve the logging function about omega, the first step is familiar with omega and save search query and the results successfully.

I'd start this by writing a detailed plan of what you intend to implement.

J

-- 
 James Aylett, occasional troublemaker & project governance
 xapian.org