GSoC 2016 Letor dataset discussion
Parth Gupta
pargup8 at gmail.com
Sat May 14 19:09:17 BST 2016
I used a subset of INEX 2009 with around 2M documents (some details here:
https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme)
and it worked fine. If you have access to it, should work for most of our
purposes.
As the INEX documents have rich xml meta-data, letor can benefit in terms
of fields (title, body etc.)
For unit-testing, as James mentions, go with automated tests in a
controlled environment. Use INEX data-set for explicit evaluation and see
if everything works without breaking at large scale.
Cheers
Parth
On Sat, May 14, 2016 at 9:57 PM, James Aylett <james-xapian at tartarus.org>
wrote:
> On Sat, May 14, 2016 at 04:51:57PM +0530, Ayush Tomar wrote:
>
> > I wanted to decide the dataset that should be used for Letor
> stabilisation
> > project.
>
> Is this for evaluating the various letor approaches? For unit tests
> you'll need to generate your own test data (partly so you can control
> it better to do validation properly, but also because the licenses
> almost never work).
>
> Parth should be able to advise on suitable datasets for evaluating
> letor.
>
> J
>
> --
> James Aylett, occasional trouble-maker
> xapian.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160514/3d510daa/attachment.html>
More information about the Xapian-devel
mailing list