GSOC 2017 Project: Learning to Rank Click Data Mining

Tue Mar 21 05:03:09 GMT 2017

Hi, James

Thanks for your reply.

I think it's also important to have clear documentation for the logging we
> use, so that non-omega users can benefit from this work. Note that omega
> already supports some logging, using $log{}.

I'm quite agree with you that clear documentation is important. And thanks
for point out that omega already supports some logging，I will look through
it.

 but I'd recommend driving your format based on what information you need,
> rather than on what others have done

Yes, I will consider what information is important for relevance judgement
first, I look at the AOL general query logs because I think that they will
log the most important thing, which can guide me to find the information I
need.

It doesn't have to be a user identifier; it can be a search or session id.
> (This can have different privacy implications, although we don't want to
> try to give recommendations on that side of things.)

for the user identifier what I mean is the identifier  that the search
query cones from, which actually the same as a search or session id or
 search IP.

I'd start this by writing a detailed plan of what you intend to implement.

I'm trying to write a draft  proposal and will summit it to the GSOC
website soon.

Thanks

2017-03-19 19:55 GMT+08:00 James Aylett <james-xapian at tartarus.org>:

> On 13 Mar 2017, at 17:10, YuLun Cai <buptcyl at gmail.com> wrote:
>
> > 1. where can we get your click data.  we can extend the omega to
> supports log the user's search and clicked documents
>
> I think it's also important to have clear documentation for the logging we
> use, so that non-omega users can benefit from this work. Note that omega
> already supports some logging, using $log{}.
>
> > 2. the specific click data information and format. Based on some paper
> and public query dataset format(AOL search query logs[1] and Sogou chinese
> query logs[2]), I think click data should contain: user identifier like ip
> or cookies, timestamp, query content, the showing document id list, clicked
> document id.
>
> You need to make a judgement call as to whether training on historical
> data from the specific Xapian use, or something like the AOL general query
> logs, is going to produce better results. (I'm guessing that will depend on
> the method used to generate relevance judgements.)
>
> If you're just using these are references for coming up with a format,
> then this isn't an issue (but I'd recommend driving your format based on
> what information you need, rather than on what others have done).
>
> > The specific format is:
> > User identifier \t timestamp \t query contents \t the showing document
> id list \t
> > clicked document id \n
>
> It doesn't have to be a user identifier; it can be a search or session id.
> (This can have different privacy implications, although we don't want to
> try to give recommendations on that side of things.)
>
> > 3. relevance judgements.
>
> [snip]
>
> It sounds like you're looking at the right kind of resources for this.
> Your proposal should be detailed about which route(s) you think are most
> likely to yield a helpful approach, since this will affect your timeline.
> (If it makes sense to do some tests of different approaches, that is
> something you can propose to do during the community bonding period, if you
> have time, or at the start of the project itself.)
>
> > 4. I do not understand fully about ' what a sensible workflow is, for
> people who want this to be run automatically on a regular basis to update
> their Letor model ', is it means that some users want to  automatically get
> addition training data on a regular basis and then update(retraining) the
> Letor module, so we should provide a sensible workflow and docs to them?
>
> Yes. Particularly for small sites, or startups who might be evolving
> rapidly, the detail of what constitutes a good search result may change
> quite rapidly. In general, this will always change over time even with
> large and stable sites. So it's important to be able to update the trained
> Letor model.
>
> > Based on the above understanding, here is my plan about the next period:
> > 1. achieve the logging function about omega, the first step is familiar
> with omega and save search query and the results successfully.
>
> I'd start this by writing a detailed plan of what you intend to implement.
>
> J
>
> --
>  James Aylett, occasional troublemaker & project governance
>  xapian.org
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170321/287ee034/attachment-0001.html>