[Xapian-devel] Indexing INEX collection for your GSoC Project

Mon May 19 10:58:36 BST 2014

Hi Aarsh,

I see we miss each other on the IRC, so I am replying you here.

It will be a good idea if all the GSoC students, who require some external
datasets for testing and development, use the same collection.

I recommend you INEX collection which also will be used by LTR students. I
have a doubt that you have got the correct collection or not, because I
read you mentioning IMDB. The collection which I referred is Wikipedia
collection (NOT IMDB) and is available at:
http://www.mpi-inf.mpg.de/departments/d5/software/inex/

Some details are available at LTR project idea page:
http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:LearningtoRank

For indexing these XML documents, simply you should treat them as HTML by
doing "--mime-type xml:text/html". Although this is not the correct way but
it does the job and gets you started.

There is also some efficiency notes on my Jounral page during GSoC 2011
(See coding week 3) http://trac.xapian.org/wiki/GSoC2011/LTR/Journal

For the queries, you can use Topics distributed with INEX for the "Ad-hoc
Retrieval Task" (as mentioned on the LTR project idea page).

You can write your own iterator to parse and iterate over query file. See
prepare_training_file() method in xapian-letor (
https://github.com/parthg/xapian/blob/master/xapian-letor/letor_internal.cc#L356)
which does that.

If you want to consider a large query set then you might be intersted in
Million Query Set (http://trec.nist.gov/data/million.query09.html) which
contains 40k web Queries. If you need even larger set then go for AOL Query
Logs  (http://jeffhuang.com/search_query_logs.html) which contains 36M
Queries.

Cheers,
Parth.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140519/1552af12/attachment.html>