[Xapian-devel] [GSoC2012] Learning to Rank: few thoughts/issues

Sun Apr 1 12:30:53 BST 2012

Hello,

I would like to work with Orange as part of GSoC 2012(and continue
henceforth). Apologies for joining in a bit late- i was waiting to get a
proper grasp of things before discussing it here. Currently I am a Masters
students in Mathematics with my bachelors in Computer Science[integrated
dual degree]. Over the last year and a half, I have worked on a few ML
projects and have a couple of publications(including one at an
ACL'11<http://www.acl2011.org/>workshop).

Last year at Machine Learning Summer
School[MLSS<http://mlss2011.comp.nus.edu.sg/index.php?n=Site.Speakers>]
at NUS, I attended Hang
Li<http://research.microsoft.com/en-us/people/hangli/>(MSR)'s
tutorial on Learning to Rank. I have discussed a few things with him(over
mail) about feature extraction for LTR algorithms. Over the last week I
have been following the mailing list discussions here and researching a bit
about the issues myself. I wanted to discuss about a few issues/thoughts:

*Doubt1:*

*Feature Extraction/Selection:*
The various datasets listed on MSR's LETOR have a limited set of features.
Current implementation in xapian's LETOR has 5
features[tf,idf,doc_len,coll_tf,coll_len]. While algorithms for learning
ranking models have been intensively studied, this is not the case for
feature selection, despite of its importance. In a paper presented at
SIGIR'07 [Tier1 in IR domain], the authors have highlighted the
effectiveness of feature selection methods for ranking
tasks.[link<http://research.microsoft.com/en-us/people/tyliu/fsr.pdf>]
I believe that apart from the traditional/cliched IR features, we
should*incorporate new features
* to improve the performance of the LETOR module.

*Using unlabeled data:*
Over the last 3-4 years a lot of papers have identified the importance of
using unlabeled data to assist the task at hand by using it during feature
extraction stage. Andrew Ng proposed a Self-Taught learning
framework[ICML'07
paper<http://ai.stanford.edu/~hllee/icml07-selftaughtlearning.pdf>]
wherein they make use of unlabeled data to improve performance. A very
recent paper at
ICML'11<http://eprints.pascal-network.org/archive/00008597/01/342_icmlpaper.pdf>used
the advantage of feature learning using unlabeled data and beat the
state-of-the-art in sentiment classification.

Combining the above two points, I suggest an approach which uses features
learnt from data in an unsupervised fashion "*in addition to*" the commonly
used features.
*Please note:* all this is in addition to the traditional features and
finally we would be using *listwise/pairwise approaches*[ListMLE, et
cetera] to train our models on the new set of features. Please let me know
if this sounds good.

*Doubt2:*

*Rank Aggregation:*
Now that Xappian will have >1 Learning to rank algorithms, we should look
into some kind of rank aggregation as well: combining outputs by various
algorithms to get a final rank ordering for results. I went though a
ECML'07 paper on unsupervised method for the
same[link<http://l2r.cs.uiuc.edu/~danr/Papers/KlementievRoSm07.pdf>].
I haven't yet completely understood their approach but will do so by the
end of day.

*Modularity:*
Developing such modules in a modular fashion such that its not necessary to
use all of them all the times, would be good. Whenever the user feels that
in addition to basic features, he/she could use additional features, the
feature extraction module could be plugged in. Same for rank aggregation.

*Relevant Background:*
I have worked on few research oriented projects in Machine Learning, but
most of them involved coding in Matlab/Java. More details about me:
[link<http://www.rishabhmehrotra.com/index.htm>
].
I have been working on a project on Topic Modeling(using Latent Dirichlet
Allocation) for Tweets. Link <http://code.google.com/p/tweettrends/> of the
code on Google code. Also, I am involved in a collage project on
building *focused
crawler *& extending it to something like NELL
<http://rtw.ml.cmu.edu/rtw/><far-fetched
dream as of now :) >.[Google code
link<http://code.google.com/p/bits-crawler/source/browse/>
]

Please let me know how you feel about the above mentioned points [and/or if
I am way off the track].

Best,
Rishabh.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120401/dec53ec2/attachment.htm>