<div class="gmail_quote">Hello,<div><br></div><div>I would like to work with Orange as part of GSoC 2012(and continue henceforth). Apologies for joining in a bit late- i was waiting to get a proper grasp of things before discussing it here. Currently I am a Masters students in Mathematics with my bachelors in Computer Science[integrated dual degree]. Over the last year and a half, I have worked on a few ML projects and have a couple of publications(including one at an <a href="http://www.acl2011.org/" target="_blank">ACL'11</a> workshop).</div>
<div><br></div><div>Last year at Machine Learning Summer School[<a href="http://mlss2011.comp.nus.edu.sg/index.php?n=Site.Speakers" target="_blank">MLSS</a>] at NUS, I attended <a href="http://research.microsoft.com/en-us/people/hangli/" target="_blank">Hang Li</a>(MSR)'s tutorial on Learning to Rank. I have discussed a few things with him(over mail) about feature extraction for LTR algorithms. Over the last week I have been following the mailing list discussions here and researching a bit about the issues myself. I wanted to discuss about a few issues/thoughts:</div>
<div><br></div><div><b>Doubt1:</b></div><div><br></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div><b>Feature Extraction/Selection:</b></div><div>The various datasets listed on MSR's LETOR have a limited set of features. Current implementation in xapian's LETOR has 5 features[tf,idf,doc_len,coll_tf,coll_len]. While algorithms for learning ranking models have been intensively studied, this is not the case for feature selection, despite of its importance. In a paper presented at SIGIR'07 [Tier1 in IR domain], the authors have highlighted the effectiveness of feature selection methods for ranking tasks.[<a href="http://research.microsoft.com/en-us/people/tyliu/fsr.pdf" target="_blank">link</a>] I believe that apart from the traditional/cliched IR features, we should<b> incorporate new features</b> to improve the performance of the LETOR module.</div>
<div><br></div><div><b>Using unlabeled data:</b></div><div>Over the last 3-4 years a lot of papers have identified the importance of using unlabeled data to assist the task at hand by using it during feature extraction stage. Andrew Ng proposed a Self-Taught learning framework[ICML'07 <a href="http://ai.stanford.edu/~hllee/icml07-selftaughtlearning.pdf" target="_blank">paper</a>] wherein they make use of unlabeled data to improve performance. A very recent <a href="http://eprints.pascal-network.org/archive/00008597/01/342_icmlpaper.pdf" target="_blank">paper at ICML'11</a> used the advantage of feature learning using unlabeled data and beat the state-of-the-art in sentiment classification.</div>
<div><br></div><div>Combining the above two points, I suggest an approach which uses features learnt from data in an unsupervised fashion "<b>in addition to</b>" the commonly used features.</div><div><b>Please note:</b> all this is in addition to the traditional features and finally we would be using <b>listwise/pairwise approaches</b>[ListMLE, et cetera] to train our models on the new set of features. Please let me know if this sounds good.</div>
<div><br></div></blockquote><b>Doubt2:</b><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div><b>Rank Aggregation:</b></div><div>Now that Xappian will have >1 Learning to rank algorithms, we should look into some kind of rank aggregation as well: combining outputs by various algorithms to get a final rank ordering for results. I went though a ECML'07 paper on unsupervised method for the same[<a href="http://l2r.cs.uiuc.edu/~danr/Papers/KlementievRoSm07.pdf" target="_blank">link</a>]. I haven't yet completely understood their approach but will do so by the end of day.</div>
</blockquote><div><div><br></div><div><b>Modularity:</b></div><div>Developing such modules in a modular fashion such that its not necessary to use all of them all the times, would be good. Whenever the user feels that in addition to basic features, he/she could use additional features, the feature extraction module could be plugged in. Same for rank aggregation.</div>
<div><br></div><div><b>Relevant Background:</b></div><div>I have worked on few research oriented projects in Machine Learning, but most of them involved coding in Matlab/Java. More details about me: [<a href="http://www.rishabhmehrotra.com/index.htm" target="_blank">link</a>]. </div>
<div>I have been working on a project on Topic Modeling(using Latent Dirichlet Allocation) for Tweets. <a href="http://code.google.com/p/tweettrends/" target="_blank">Link</a> of the code on Google code. Also, I am involved in a collage project on building <b>focused crawler </b>& extending it to something like <a href="http://rtw.ml.cmu.edu/rtw/" target="_blank">NELL</a><far-fetched dream as of now :) >.[Google code <a href="http://code.google.com/p/bits-crawler/source/browse/" target="_blank">link</a>]</div>
<div><br></div><div>Please let me know how you feel about the above mentioned points [and/or if I am way off the track]. </div><div><br></div><div>Best,</div><div>Rishabh.</div></div></div>