GSoC 2016 Letor dataset discussion

Ayush Tomar ayushtomar at gmail.com
Sat May 14 12:21:57 BST 2016


Hello,

I wanted to decide the dataset that should be used for Letor stabilisation
project.

I think 2009 INEX Wikipedia Collection
<http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/software/inex/>
should work fine. It's a collection of 2,666,190 XML articles, 115 topics
<http://inex.mmci.uni-saarland.de/protected/adhoc/2009-topics.zip>, 50,275
qrel <http://inex.mmci.uni-saarland.de/protected/adhoc/2009-inex_eval.zip>
labels and has an uncompressed size of 50.75 gb (5.52 GB compressed).

Another similar alternative is 2013 INEX Wikipedia LOD Collection
<http://inex-lod.mpi-inf.mpg.de/2013/>. It's a collection of 12,216,109 XML
articles, 144 topics
<http://inex.mmci.uni-saarland.de/protected/dc/2013-ld-adhoc-topics.xml>,
14,400
qrel <http://inex.mmci.uni-saarland.de/protected/dc/2013-ld-adhoc-qrels.zip>
labels. It has a compressed size of 11.12 GB. INEX 2009 Collection is a
subset of it.

If there are any recent/better datasets that can be used, please let me
know.

Thanks,
Ayush
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160514/1a435bc8/attachment.html>


More information about the Xapian-devel mailing list