Logging the click data

James Aylett james at tartarus.org
Sun Jun 4 15:05:47 BST 2017


On 3 Jun 2017, at 22:08, Vivek Pal <vivekpal.dtu at gmail.com> wrote:

> This helped me look at things associated with logging the click data from
> a better perspective. As already documented on the project's wiki page,
> we need the following fields in separater columns:
> 
> 1. ID: some identifier for each entry
> 2. QUERY: text of the query
> 3. URLs: list of the URLs of the documents displayed on the result page
> 4. CLICKS: list of clicks where each element is the number of times the
> corresponditng URL was clicked

That shouldn't be the logging format, for reasons I'll get into shortly. That's an intermediate view which you'll need to generate from the logging, which will enable you to create the input files for letor training.

> It seems more natural to me to implement a secondary log command and trigger
> it everytime a new query is entered into the query template. It would create
> a log file with the above columns/fields i.e. a unique identifier for each
> log entry, entered query text, list of documents URLs displayed, a list
> of the number of times the corresponding URL was clicked (all the elements
> in this list will be initialised as 0 as clicks haven't occurred yet.)
> 
> Once we have the log file, all we need to do is update the fourth column with
> click information whenever a click happens by looking for the correct entry
> in the file (e.g. by matching the query text) and update the list in the
> fourth column accordingly.
> 
> Does this entire idea sounds workable?

The problem is "all we need to do is update the fourth column". Updating is hard, in the sense that every thread processing web requests has to be able to update the same view (which means you're basically implementing some sort of database, and have to consider concurrent access and updating). Easier would likely be to log each individual click, and then provide something that can process those "raw" files into the intermediate format you need for your click model ahead of letor training.

I think this makes your job at this point easier, because now you're looking to emit a pair of files which you can later roll up into the above format (which is a fairly simple aggregation step). One would contain entries as follows, with a new entry for each executed search:

ID: some identifier for each query
QUERY: text of the query (when the query is run)
URLs: every URL displayed (or alternatively, the Xapian docid — this might be easier)
OFFSET: otherwise you'll have difficulty coping with result pages other than the first page (when this happens, the query ID should probably remain the same, and when you aggregate you can "glue" the different pages together)

One would then be the clicks, so for each URL clicked in a result page, emit:

ID: the query identifier that matches the entry in the search log
URL: the URL redirected to (again, or the Xapian docid)

This means you need to be able to generate ID for each query, and also that each clickable URL in the results page will need to go via the omega CGI using a different template whose job it is to log ID & URL to the click log and then redirect to URL. Once generated, the ID can be passed through from call to call (including on pagination). (We'll need a new ID when the query is changed, in the same way that we reset the page offset, which works by considering xP, xDB, and xFILTERS.)

If you record the Xapian docid rather than the URL, it's both more compact and easier to serialise for the search entries (eg something like: ID,docid1|docid2|docid3|…,OFFSET,QUERY). It can also cope with multiple documents that lead to the same URL but with different metadata. The downside is that if someone updates the Xapian document to point to something totally different, it's difficult to analyse across that boundary, and so the letor model will be less helpful. (Of course, if it's for an internal search system, the same is true of URLs if someone changes the content significantly. So it's not a huge downside, and can simply be documented.)

J

-- 
 James Aylett
 devfort.com — spacelog.org — tartarus.org/james/




More information about the Xapian-devel mailing list