GSoC 2017: Letor Click Data Mining

Vivek Pal vivekpal.dtu at gmail.com
Wed Mar 22 14:27:25 GMT 2017


Hi James,

> Isn't this from the query template, ie from the main web page of search
> results? (It might make sense from opensearch as well, though.)

Yes, you are right; it is the query template. The reason I said opensearch
template is that I haven't quite read all sections of the Omega docs and I'm
still in the process. Thanks for pointing that out.

I'm aiming to cover most of it in a day or two to have a good understanding of
how the project will fit in. However, I won't be able to cover all the Omega-
-Script commands but atleast the most related ones like $log.

> We need some way of logging when people click on a search result — which
> you can build using a second omegascript template, as Olly suggested.

Okay, so it will act between the query template and a linked document pointed
by a search result. Do you think we need to make this new template transparent
to the user in some way as we might have to record some information such as
user ids in the form of IP? In any case, we'll need a way to distinguish
between different users by assigning unique ids to them.

> So the only thing you really need to know is the ENTRY format, so you can
> figure out how to log what you need. (Which you should identify before
> diving into code.)

I see; though it would be helpful to also have an example in the documentation
for the same? There's a DEFAULT_LOG_ENTRY string in query.cc that I can across
while on the word_in_list PR:

"$or{$env{REMOTE_HOST},$env{REMOTE_ADDR},-}\t"
"[$date{$now,%d/%b/%Y:%H:%M:%S} +0000]\t"
"$if{$cgi{X},add,$if{$cgi{MORELIKE},morelike,query}}\t"
"$dbname\t"
"$query\t"
"$msize$if{$env{HTTP_REFERER},\t$env{HTTP_REFERER}}";

Could you explain the meaning of third and and last strings?

> You need to think more carefully about the layers involved here. We don't
> want to post-process the output of a template...

Yes, so I thought about it in detail and I think the whole process would like
the following from a broad perspective:

1. Rearrangement: Input the original results to the FairPairs which will
rearrange them and the rearranged results will be presented on the query
template.

2. Logging: Log the required data using a new template and store it in an
appropriate format for further processing.

3. Click Models: These are successors of preference pair models which I
mentioned earlier. We have some options here as descibed in book "Click
 Models for Web Search" such as DBN, DCN, CCN etc. which will be trained
on a relevance dataset to provide us with relevance scores of results links in
our logs using which we'll generate Qrel file as used by xapian-letor.

To train a click model, we'd need a relevance prediction dataset that should
contain human generated binary relevance labels for query-document pairs.
I'm curious to know from where we can obtain such a dataset. One that I know
of is Yandex web seach challenge dataset on Kaggle.

And, thanks for the link to MSet re-ordering system. I'll check out ideas that
were discussed there.

> That page is ancient, so I hope you're actually installing the 1.4 series
> Xapian and Omega!

Latest stable release is 1.4 series but I actually have 1.5 series installed
which I think is because I installed dev version from latest git master. I
don't think that should be a problem here?

> That looks to me like you haven't installed omega, but are trying to run
> with the development version

I've all xapian related executables in /usr/local/bin including omindex. Does
that suggest Omega is installed?

> When you ran `make install` for omega, it will have copied the CGI somewhere

In /usr/local/lib/xapian-omega/bin, I can't find CGI but these file:
mhtml2html, omega, outlookmsg2html, rfc822tohtml and vcard2text.

> More generally, I'd recommend reading the omega documentation.

Yes, I'll go through it. I'll give it a second try after reading the docs and
may be ask for help with setting up Omega on IRC if I run into an issue again.

Thanks,
Vivek



More information about the Xapian-devel mailing list