[Xapian-devel] New Idea on Ranking in IR

Parth Gupta parthg.88 at gmail.com
Tue Apr 5 07:59:16 BST 2011

Your comment on training data from 'click-through' certainly makes sense.
But I am not quite sure what sources can we use to produce such data as of
now but u can suggest. Main idea would be make infrastructure for the
learning and learn the ranking function from the currently available
training data and as and when better data comes or is generated we can
re-learn the ranking function.

If some of our features are time dependent then anyway we have to do the
training at some interval. So it should not be a problem to incorporate
changes in order to facilitate new and better training data to the

Okay So I submit the application in keeping these feedbacks in consideration
and after reviewing the entire application if something needs more
clarification of course within the deadline then please let me know.


On Mon, Apr 4, 2011 at 8:16 AM, Olly Betts <olly at survex.com> wrote:

> On Sun, Apr 03, 2011 at 08:57:36PM +0530, Parth Gupta wrote:
> > Click-through measurements are certainly good measure for automatic
> > preparation of training data. But what I have in my mind is if we
> consider
> > relevance as a binary variable then For the training data there are many
> > relevance judgements are available for ad-hoc retrieval task in many good
> IR
> > conferences like TREC or FIRE, so we can prepare the feature vectors from
> > them. It will be a first benchmark for the project guideline. It will be
> > reliable too because it is human-judged and comprises both the relevant
> and
> > non-relevant documents. So an unbiased sample and good for machine
> learning.
> That's OK for developing this.  But it seems likely that training in one
> domain won't transfer reliably to another, so someone developing a
> search which uses this will really need their own training data.
> So for a developer wanting to deploy this, being able to automatically
> crowd-source my training data by tracking clicks on search results is
> much more appealing than having to invest time and/or money in getting
> relevance judgement produced specially.  Click data also allows training
> to be a more continuous process, which is beneficial for sites where
> topics evolve fairly quickly with time (like news sites).
> The click data is almost certainly going to be noisier, which might be
> an issue for training, but for a busy site you can easily produce much
> more of it than you can with explicit relevance judgements, so perhaps
> the noise can be filtered out if it is an issue.
> > Also I am very new to the formalities to submit the application for the
> GSoC
> > so if the things happen early then I would have enough time to shape the
> > application considering feedbacks.
> The formalities are that you need to file an application here before
> 1900UTC on April 8th:
> http://socghop.appspot.com/gsoc/org/google/gsoc2011/xapian
> But it's a good idea to get your application in sooner than that to
> give us a chance to review it and make comments.  There's also likely to
> be a surge in proposals as the deadline nears.  You're able to make
> changes up until the deadline.
> Cheers,
>     Olly
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110405/ede370e1/attachment.htm>

More information about the Xapian-devel mailing list