<div dir="ltr">Hi, James<div><br><div><br><div>Thanks for your reply.</div></div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="font-size:14px">I think it's also important to have clear documentation for the logging we use, so that non-omega users can benefit from this work. Note that omega already supports some logging, using $log{}.</span></blockquote><div> </div><div><br></div><div>I'm quite agree with you that clear documentation is important. And thanks for point out that omega already supports some logging,I will look through it.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <span style="font-size:14px">but I'd recommend driving your format based on what information you need, rather than on what others have done</span></blockquote><div> </div><div><br></div><div>Yes, I will consider what information is important for <span style="font-size:14px">relevance judgement first, I look at the </span><span style="font-size:14px">AOL general query logs because I think that they will log the most important thing, which can guide me to find the </span>information I need.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="font-size:14px">It doesn't have to be a user identifier; it can be a search or session id. (This can have different privacy implications, although we don't want to try to give recommendations on that side of things.)</span></blockquote><div><br></div><div> <span style="font-size:14px"> </span></div><div><span style="font-size:14px">for the </span><span style="font-size:14px">user identifier what I mean is the </span><span style="font-size:14px">identifier</span><span style="font-size:14px"> that the search query cones from, which actually the same as </span><span style="font-size:14px">a search or session id or search IP.</span></div><div><span style="font-size:14px"><br></span></div><div><span style="font-size:14px"><br></span></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="font-size:14px">I'd start this by writing a detailed plan of what you intend to implement.</span></blockquote><div> </div><div><br></div><div>I'm trying to write a draft proposal and will summit it to the GSOC website soon.</div><div><br></div><div>Thanks</div><div> </div></div></div><div class="gmail_extra"><br><div class="gmail_quote">2017-03-19 19:55 GMT+08:00 James Aylett <span dir="ltr"><<a href="mailto:james-xapian@tartarus.org" target="_blank">james-xapian@tartarus.org</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 13 Mar 2017, at 17:10, YuLun Cai <<a href="mailto:buptcyl@gmail.com">buptcyl@gmail.com</a>> wrote:<br>
<br>
> 1. where can we get your click data. we can extend the omega to supports log the user's search and clicked documents<br>
<br>
</span>I think it's also important to have clear documentation for the logging we use, so that non-omega users can benefit from this work. Note that omega already supports some logging, using $log{}.<br>
<span class=""><br>
> 2. the specific click data information and format. Based on some paper and public query dataset format(AOL search query logs[1] and Sogou chinese query logs[2]), I think click data should contain: user identifier like ip or cookies, timestamp, query content, the showing document id list, clicked document id.<br>
<br>
</span>You need to make a judgement call as to whether training on historical data from the specific Xapian use, or something like the AOL general query logs, is going to produce better results. (I'm guessing that will depend on the method used to generate relevance judgements.)<br>
<br>
If you're just using these are references for coming up with a format, then this isn't an issue (but I'd recommend driving your format based on what information you need, rather than on what others have done).<br>
<span class=""><br>
> The specific format is:<br>
> User identifier \t timestamp \t query contents \t the showing document id list \t<br>
> clicked document id \n<br>
<br>
</span>It doesn't have to be a user identifier; it can be a search or session id. (This can have different privacy implications, although we don't want to try to give recommendations on that side of things.)<br>
<br>
> 3. relevance judgements.<br>
<br>
[snip]<br>
<br>
It sounds like you're looking at the right kind of resources for this. Your proposal should be detailed about which route(s) you think are most likely to yield a helpful approach, since this will affect your timeline. (If it makes sense to do some tests of different approaches, that is something you can propose to do during the community bonding period, if you have time, or at the start of the project itself.)<br>
<span class=""><br>
> 4. I do not understand fully about ' what a sensible workflow is, for people who want this to be run automatically on a regular basis to update their Letor model ', is it means that some users want to automatically get addition training data on a regular basis and then update(retraining) the Letor module, so we should provide a sensible workflow and docs to them?<br>
<br>
</span>Yes. Particularly for small sites, or startups who might be evolving rapidly, the detail of what constitutes a good search result may change quite rapidly. In general, this will always change over time even with large and stable sites. So it's important to be able to update the trained Letor model.<br>
<span class=""><br>
> Based on the above understanding, here is my plan about the next period:<br>
> 1. achieve the logging function about omega, the first step is familiar with omega and save search query and the results successfully.<br>
<br>
</span>I'd start this by writing a detailed plan of what you intend to implement.<br>
<span class="HOEnZb"><font color="#888888"><br>
J<br>
<br>
--<br>
James Aylett, occasional troublemaker & project governance<br>
<a href="http://xapian.org" rel="noreferrer" target="_blank">xapian.org</a><br>
<br>
<br>
<br>
</font></span></blockquote></div><br></div>