GSoC 2016 Letor dataset discussion

Sat May 14 19:09:17 BST 2016

I used a subset of INEX 2009 with around 2M documents (some details here:
https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme)
and it worked fine. If you have access to it, should work for most of our
purposes.

As the INEX documents have rich xml meta-data, letor can benefit in terms
of fields (title, body etc.)

For unit-testing, as James mentions, go with automated tests in a
controlled environment. Use INEX data-set for explicit evaluation and see
if everything works without breaking at large scale.

Cheers
Parth

On Sat, May 14, 2016 at 9:57 PM, James Aylett <james-xapian at tartarus.org>
wrote:

> On Sat, May 14, 2016 at 04:51:57PM +0530, Ayush Tomar wrote:
>
> > I wanted to decide the dataset that should be used for Letor
> stabilisation
> > project.
>
> Is this for evaluating the various letor approaches? For unit tests
> you'll need to generate your own test data (partly so you can control
> it better to do validation properly, but also because the licenses
> almost never work).
>
> Parth should be able to advise on suitable datasets for evaluating
> letor.
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160514/3d510daa/attachment.html>