<div dir="ltr">Hi Mayank,<br><br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>
<div><div>Before starting my proposal, I wanted to know what is the expected output of Letor module. Is it for transfer learning (i.e you learn from one dataset and leverage it to predict the rankings of other dataset) or is it for supervised learning?<br>
</div><br></div>For instance - Xapian currently powers the Gmane search which is by default based on BM25 weighting scheme and now suppose we want to use LETOR to rank the top k retrieved search results, lets take SVMRanker for an example, will it rank the Gmane's search results based on the weights learned from INEX dataset because the client won't be providing any training file. And also I don't think it'll perform good for two datasets of different distributions. So how are we going to use it?<br>
</div></div></blockquote><div><br></div><div>The actual purpose of xapian-letor is to provide learning to rank system to the user who intend to perform search. Though, it may sound naive and simple, but it is the actual goal of the letor module. Letor being a supervised ranking algorithm its requires gold-standard labels which unsupervised methods like BM25 or TF-IDF do not demand. <br>
<br></div><div>From the application point of view, we provide user a complete API which user can deploy to do search rank. We do not provide any gold-standard data or document collection. Hence, if the user has document collection on which she intend to do search and has some gold labels on that collection, she is good to use xapian-letor. We provide a platform, which can extract features from document and create training collection, learn the ranking function and than perform ranking on unseen queries once the model is trained. <br>
<br></div><div>Of course it is "little" hard to obtain gold labels but research on the clickthrough data is providing means to obtain some automatically. <br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr"><div>
<br></div>PROPOSAL-<br><div><div><div><br>1.Sorting out Letor API will include -<br><ul><li>Implementing SVMRanker and checking its evaluation results against the already generated values.</li></ul><ul><li>Implementing evaluation methods. Those methods will include MAP and NDCG. (<i>Is there any other method in particular that can be implemented other than these two?</i>)</li>
</ul></div></div></div></div></blockquote><div>The most common are these two. While implementing them you will also use precision and recall. <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr"><div><div><div><ul>
</ul><ul><li>Check the performance of ListMLE and ListNet against SVMRanker.(<i>Considering both ListMLE and ListNet has been implemented correctly but we don't have any tested performance measurement of these two algorithms</i>. <i>Therefore I want to know what should be course of action for this?</i>)<br>
</li></ul></div></div></div></div></blockquote><div> We need to check how the ListMLE and ListNet performs and if something is wrong then debug them. The best method is to use the a common evaluation environment for three of them and check/correct.<br>
</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><ul><li>Implementing Rank aggregator. I've read about <b>Kemmy-Young Method</b>. Can you provide me with the names of the algorithms based on what should be implemented here or what was proposed last-to-last year. Also is there a way to check any ranker's performance(<i>since INEX dataset doesn't provide ranking</i>).</li>
</ul></div></div></div></div></blockquote><div>I am not sure we should include rank aggregation or not but one of the paper to refer would be <a href="http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf">http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf</a><br>
<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><ul>
</ul><p>2. Implementing automated tests will include -</p><ul><li>For testing, 20 documents and 5 queries can be picked from the INEX dataset, put to test and checked against their expected outputs.</li></ul><ul><li>Implemented evaluation metrics can also be used to test learning algorithms.</li>
</ul></div></div></div></div></blockquote><div>I think last year Gaurav Arora (IRC nick: samuaelharden) was handling some evaluation but I am not sure the state of it and you can check if that can be used for letor in terms of passing Letor::RankList as parameter and receiving MAP or NDCG value.<br>
</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><ul>
</ul><p>3.Implementing a feature selection algorithms-</p><ul><li>I have a question here. Why are we planning to implement feature selection algorithm when we have only 19 features vectors. I don't think it'll over-fit the dataset. Also from what I have learnt, feature selection algorithms(like PCA in classification) are used only for time or space efficiencies.</li>
</ul></div></div></div></div></blockquote><div>Feature selection is a utility if someone wants to use it. xapian-letor can also operate on the data outside the limit of currently implemented 19 features. These 19 features are which we can extract but if user has already a training file with 300 features, she should be able to train the letor model over that file and when she wants to rank a document, she should be able to provide the similar feature vector and in between the feature selection algorithm can help.<br>
<br></div><div>Feature selection algorithms have really proved to significantly outperform the full feature set. See both references of feature selection in the resources section on Project Ideas page.<br><br></div><div>This time we want to make sure adding more features become very easy for anybody. For example a new feature can be term frequency of query terms in URL which will become 20th feature. The API should be very flexible for this extension.<br>
<br></div><div>Cheers,<br></div><div>Parth.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><ul>
</ul><p>Please do provide some feedback so that I can improve upon it.</p><span class=""><font color="#888888"><p>-Mayank<br></p></font></span></div></div></div></div>
<br>_______________________________________________<br>
Xapian-devel mailing list<br>
<a href="mailto:Xapian-devel@lists.xapian.org">Xapian-devel@lists.xapian.org</a><br>
<a href="http://lists.xapian.org/mailman/listinfo/xapian-devel" target="_blank">http://lists.xapian.org/mailman/listinfo/xapian-devel</a><br>
<br></blockquote></div><br></div></div>