<div dir="ltr"><div>Hi Jiarong,<br></div><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">
<div><div class="gmail_extra"><div class="gmail_quote"><div class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><p><span></span></p>
<p></p><ol><li>In letor.cc, we have two parts of functions: the training part and the ranking part. I’ll use SVMRanker as an example. The training part basically uses the libsvm library and training data to train a model, then save the model file. The ranking part will calculate score for each document in searching results (MSet) by using the trained model file. My question is for each of our three rankers: 1) SVMRanker 2) ListMLE 3) ListNet, do we need three different types of training part? (The ranking part for each of those have the same form I think) I’m not sure the parameters for these three different rankers are the same or not (I guess they’re different). In my understanding, the letor.cc basically just pass parameters ranker. It’s the ranker will do training and calculating things actually. So if we can generalize the form for training part, we don’t need functions like prepare_training_data_for_svm, prepare_training_data_for_listwise etc. We just need prepare_training_data instead. (We can benefit from inheritance of ranker in training part just like in ranking part)</li>
</ol></div></blockquote></div><div>In general, I think we need a different training part for each ranker. There may be some similarities in these existing rankers, and inheritance would be a sensible way to avoid duplicating code if so, but we'd like to have a framework which we can extend to a completely different type of ranker in future.</div>
</div></div></div></div></blockquote><div><br>Ideally, we decided to have only a single method like
prepare_training_file and it would be the responsibility of the Ranker's
to interpret the data the way they want for example, pairwise
approaches need pairs and so on. The data format we have decided is the
standard one and commonly used among Letor community. Example taken
from the SVM-rank page
(<a href="http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html">http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html</a>). So at
this moment I would say, please focus on only one method and remove the
others. This should also be communicated to Hanxiao.<br><font color="#000000"><dir>
<tt><line> .=. <target> qid:<qid>
<feature>:<value> <feature>:<value> ...
<feature>:<value> # <info></tt><br>
<tt><target> .=. <float><br>
<qid> .=. <positive integer></tt><br>
<tt><feature> .=. <positive integer></tt><br>
<tt><value> .=. <float></tt><br>
<tt><info> .=. <string</tt></dir></font></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div class="gmail_extra">
<div class="gmail_quote"><div class="">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><ol>
<li>There is one thing I have to confirm: once we have the training model (like model file of SVMRanker), we won’t train that model again in general. (The behavior of questletor.cc under bin/ confuses me)<br></li></ol></div>
</blockquote></div><div>I'm not familiar with the behaviour of questletor, but I suppose it's reasonable to assume that we don't update models after initial creation. It would be nice to be able to do so, but I think many training algorithms aren't updatable. I feel I may be misunderstanding your question here, though. Parth: any comment to add?</div>
</div></div></div></div></blockquote><div><br> questletor is just an example of how the code works.
Once the model is trained, you dont need to retrain it unless you
really want to. So for the better interpretation you can add a condition
in questletor that train only when model does not exist. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div class="gmail_extra">
<div class="gmail_quote"><div class="">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><ol><li>Since RankList will be removed, according to the meeting last week, its related information will be stored under MSet::Internal. My plan is to create new class under MSet::Internal. That class will have two kinds of feature vectors: normalized one and unnormalized one. Since it’s in MSet::Internal, there is a wrapper class outside it I think. So it also needs to provide corresponding APIs in that wrapper class. Also, the ranker will use MSet instead of RankList. Do you have any suggestions for this part?<br>
</li></ol></div></blockquote></div><div>This sounds like a reasonable approach. This sounds like something you could implement very soon, and that is sufficiently standalone we could try and get it merged to master on its own.</div>
</div></div></div></div></blockquote><div><br></div><div>I am not sure if you really need to store the normalised feature vector, just a method to normalize should do the job. We should definitely consult Hanxiao when he comes to a point where he sees storing a normalised version will help in some way. Btw, which type of normalisation methods are you talking about? If you refer to QueryLevelNorm (<a href="http://trac.xapian.org/wiki/GSoC2011/LTR/Notes#QueryLevelNorm">http://trac.xapian.org/wiki/GSoC2011/LTR/Notes#QueryLevelNorm</a>) then you that is the standard and your featurevector would be like that. Do you mean to further normalise it?<br>
</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div class="gmail_extra"><div class="gmail_quote"><div class="">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><ol><li>For FeatureVector, I think it could be discarded since it just stores the information of feature vector of each document, those information will be stored in the new class in MSet::Internal mentioned in 3.<br>
</li></ol></div></blockquote></div><div>Sounds right to me.</div></div></div></div></div></blockquote><div><br>Okay, sounds fair but also please store the additional information such as score and label as the featurevecor class currently does. <br>
</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div class="gmail_extra"><div class="gmail_quote"><div class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr"><ol><li>For FeatureManager, it will set the context for feature calculation, like set Database, set query and what kinds of features we want. It provides some basic information like term frequency and inverse document frequency etc. Also it will have function update_mset to touch feature information to MSet.<br>
</li></ol></div></blockquote></div><div>Again, sounds plausible.</div></div></div></div></div></blockquote><div><br></div><div>Btw we decided not to categorise features based on types like document dependant, query dependent etc in the end but we agreed to give user the power to select a subset of features may be in form of a list<Integer> or something. <br>
</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div class="gmail_extra"><div class="gmail_quote"><div class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr"><ol><li>For feature selection, I don’t know when to apply this selection. We will provide the features we want to use to FeatureManager. So the feature selection will provide some information like this feature is better so it will have larger weight? Or this algorithm will select subset of features we provide to generate feature vectors?<br>
</li></ol></div></blockquote></div><div>I'd expect the feature selection to select a subset of features: but it's also very good for it to be able to return information that a human can check over to see if it's making plausible decisions.</div>
</div></div></div></div></blockquote><div><br></div><div>Both the feature selection algorithms mentioned on the Letor ProjectIdea page are subset selection based. It is one time and happens before the training. These algorithms will give each feature a score that how important each feature is and the user needs to select top N features based on some educated heuristics presented in the corresponding paper and the computational power at disposal.<br>
</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div class="gmail_extra"><div class="gmail_quote"><div class="">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><ol><li>For automated tests, my idea is to use some data to test the functionality of letor module. It will also cover different configurations, like using different rankers, to test the functionality. I think I need some help for this part. Can someone give me some advice?</li>
</ol><p></p><p><span></span></p></div></blockquote></div><div>I'm not sure what advice you need; Parth - any ideas here?</div></div></div></div></div></blockquote><div> </div><div>The test concerning to xapina-letor would be mainly focused around the features and the rankers. So what you can do is use a small test collection with a few documents and check if the features calculated are correct or not, the ranking using each ranker is acceptable or not etc.<br>
<br></div><div>Cheers,<br></div><div>Parth. <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>_______________________________________________<br>
Xapian-devel mailing list<br>
<a href="mailto:Xapian-devel@lists.xapian.org">Xapian-devel@lists.xapian.org</a><br>
<a href="http://lists.xapian.org/mailman/listinfo/xapian-devel" target="_blank">http://lists.xapian.org/mailman/listinfo/xapian-devel</a><br>
<br></blockquote></div><br></div></div>