<div dir="ltr"><div class="gmail_default" style="font-family:monospace,monospace;font-size:small;color:rgb(11,83,148)">I used a subset of INEX 2009 with around 2M documents (some details here: <a href="https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme">https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme</a>) and it worked fine. If you have access to it, should work for most of our purposes.<br><br></div><div class="gmail_default" style="font-family:monospace,monospace;font-size:small;color:rgb(11,83,148)">As the INEX documents have rich xml meta-data, letor can benefit in terms of fields (title, body etc.)<br><br></div><div class="gmail_default" style="font-family:monospace,monospace;font-size:small;color:rgb(11,83,148)">For unit-testing, as James mentions, go with automated tests in a controlled environment. Use INEX data-set for explicit evaluation and see if everything works without breaking at large scale.<br></div><div class="gmail_default" style="font-family:monospace,monospace;font-size:small;color:rgb(11,83,148)"><br></div><div class="gmail_default" style="font-family:monospace,monospace;font-size:small;color:rgb(11,83,148)">Cheers<br></div><div class="gmail_default" style="font-family:monospace,monospace;font-size:small;color:rgb(11,83,148)">Parth <br></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, May 14, 2016 at 9:57 PM, James Aylett <span dir="ltr"><<a href="mailto:james-xapian@tartarus.org" target="_blank">james-xapian@tartarus.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="">On Sat, May 14, 2016 at 04:51:57PM +0530, Ayush Tomar wrote:<br>

<br>

> I wanted to decide the dataset that should be used for Letor stabilisation<br>

> project.<br>

<br>

</span>Is this for evaluating the various letor approaches? For unit tests<br>

you'll need to generate your own test data (partly so you can control<br>

it better to do validation properly, but also because the licenses<br>

almost never work).<br>

<br>

Parth should be able to advise on suitable datasets for evaluating<br>

letor.<br>

<span class=""><font color="#888888"><br>

J<br>

<br>

--<br>

  James Aylett, occasional trouble-maker<br>

  <a href="http://xapian.org" rel="noreferrer" target="_blank">xapian.org</a><br>

<br>

</font></span></blockquote></div><br></div></div>