GSoC 2016 Letor Stabilisation

James Aylett james-xapian at tartarus.org
Sun Mar 20 14:02:31 GMT 2016


On Sun, Mar 20, 2016 at 05:31:37PM +0530, Ayush Tomar wrote:

> I'm Ayush from New Delhi, India. I am interested in Letor Stabilisation
> project for GSoC. I have a good background in machine learning. Sorry for
> getting in so late, university exams were holding me back. I'll try to
> cover as much as I can in the coming week.

Hi, Ayush. Welcome to Xapian!

> 1. Modifying xapian-letor/bin/questletor.cc to use and test core features
> and API of letor. The current version of questletor.cc has a lot of
> unusable and broken functions and is custom made for training with INEX
> 2010 dataset. The intention is to make it usable for a user provided
> database. Currently I am using xapian-docsprint/data/100-objects-v1.csv as
> my database and some manually written queries and qrels to make things
> work.

That's helpful; I haven't looked at questletor in a while. I'm not
surprised the master version doesn't work, because (as noted in the
project) there's code that we couldn't merge for licensing reasons.

Note that where the project talks about tests, we mean automated
tests, probably unit tests. It's worth looking at how xapian-core does
these, because we'd expect a similar approach for xapian-letor. (I
think you're already clear on that, but I wanted to make sure!)

> 2. Going through v-hasu's GSoC 2014 code to understand extra
> functionalities added by him and planning how to introduce code from his
> branch.

Good.

> 1. Creating a code example that lets the user use 100-objects-v1.csv as the
> database and use Letor features and API to make queries over it.
> Documenting how to make this example run.

Note again that master probably won't be sufficient to do this. The
missing functionality (ie the unmerged work) was rewritten on v-hasu's
(Hanxiao Sun) branch, so can be pulled from there to form the base.

> 3. Writing API and unit tests

Note as the project description states that these should be done
alongside integrating work, rather than considered separately.

> I have some question:
> 
> 1. Is the procedure I mentioned above the right way to go about it? What
> are the essential portions (in terms of code) that I should complete before
> submitting the proposal?

It's not essential to complete any code ahead of the proposal, and as
you have only a week now to do the proposal that needs to be your
focus. Working with the code, however, is important to understand what
work needs to done (and so will inform your proposal). So it's not
necessary to be able to submit pull requests yet, but the work you've
been doing in getting familiar with what code is there will form the
basis of your proposal.

> 2. How can I create the test harness for xapian-letor similar to
> xapian-core and start writing tests? Tests seem somewhat overwhelming to me
> at the moment, it would be helpful if I could get some assistance on how to
> go about it.

You'll need to copy the test harness. What I'd do is to copy the whole
of the xapian-core/tests directory, then cut out all the actual
tests. What's left should be the harness and supporting code. (You'll
need to write some more support to 

> 3. How important is writing new features for this project (for instance
> implementing LambdaMART ranking)? Should I focus on them as well in my
> proposal?

Not at all. There's more than enough work in stabilising and
integrating previous work, writing tests and documentation, and
creating a fully-working system suitable for general use. If you were
to integrate all of v-hasu's branch and get that merged, then there's
VcamX's (Jiarong Wei) work to look at from 2014, although that would
require some more planning at the time (I wouldn't plan for that in
your proposal).

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org



More information about the Xapian-devel mailing list