GSoC 2017: Letor Click Data Mining

James Aylett james at tartarus.org
Wed Mar 22 20:08:38 GMT 2017


On 22 Mar 2017, at 14:27, Vivek Pal <vivekpal.dtu at gmail.com> wrote:

>> We need some way of logging when people click on a search result — which
>> you can build using a second omegascript template, as Olly suggested.
> 
> Okay, so it will act between the query template and a linked document pointed
> by a search result. Do you think we need to make this new template transparent
> to the user in some way as we might have to record some information such as
> user ids in the form of IP? In any case, we'll need a way to distinguish
> between different users by assigning unique ids to them.

You could do that by identifying the search session instead of the user, which makes it closer to what we need than to something that might trip you into privacy concerns.

>> So the only thing you really need to know is the ENTRY format, so you can
>> figure out how to log what you need. (Which you should identify before
>> diving into code.)
> 
> I see; though it would be helpful to also have an example in the documentation
> for the same?

We don't really need an example; however I didn't read the documentation carefully, so it may warrant rewording. Or maybe I should just be more diligent in future.

> There's a DEFAULT_LOG_ENTRY string in query.cc that I can across
> while on the word_in_list PR:
> 
> "$or{$env{REMOTE_HOST},$env{REMOTE_ADDR},-}\t"
> "[$date{$now,%d/%b/%Y:%H:%M:%S} +0000]\t"
> "$if{$cgi{X},add,$if{$cgi{MORELIKE},morelike,query}}\t"
> "$dbname\t"
> "$query\t"
> "$msize$if{$env{HTTP_REFERER},\t$env{HTTP_REFERER}}";
> 
> Could you explain the meaning of third and and last strings?

Third records some information about what sort of query it is — add, morelike or a plain query. Last provides the estimated match size and then the HTTP referrer if one were set. Neither is particularly interesting in this case.

> 3. Click Models: These are successors of preference pair models which I
> mentioned earlier. We have some options here as descibed in book "Click
> Models for Web Search" such as DBN, DCN, CCN etc. which will be trained
> on a relevance dataset to provide us with relevance scores of results links in
> our logs using which we'll generate Qrel file as used by xapian-letor.

… and you'll need a way to use letor from omega, or you'll have trained a model for no good reason :)

> Latest stable release is 1.4 series but I actually have 1.5 series installed
> which I think is because I installed dev version from latest git master. I
> don't think that should be a problem here?

No, that's even better. I just didn't want you to be using the very old version mentioned in the walkthrough :)

>> That looks to me like you haven't installed omega, but are trying to run
>> with the development version
> 
> I've all xapian related executables in /usr/local/bin including omindex. Does
> that suggest Omega is installed?

Yes. But if you follow the walkthrough, it copies the uninstalled version of the omega CGI.

>> When you ran `make install` for omega, it will have copied the CGI somewhere
> 
> In /usr/local/lib/xapian-omega/bin, I can't find CGI but these file:
> mhtml2html, omega, outlookmsg2html, rfc822tohtml and vcard2text.

omega is the CGI (I think).

J

-- 
 James Aylett
 devfort.com — spacelog.org — tartarus.org/james/




More information about the Xapian-devel mailing list