<div dir="ltr">Hi Jiarong, and welcome.<div><br></div><div>For future reference (both for you, and for our other GSoC students), it's best not to batch up communications, but to ask individual questions like these as they come up.  I can often respond to a short email straight away: it's taken me a while to find time to sit down and respond to this email.</div>


<div><br></div><div>Also, don't forget to update <a href="http://trac.xapian.org/wiki/GSoC2014/Learning%20to%20Rank%20Jiarong%20Wei/Journal">http://trac.xapian.org/wiki/GSoC2014/Learning%20to%20Rank%20Jiarong%20Wei/Journal</a> each day to say how you're getting on: I'm checking it daily but have seen no updates yet.  Remember that we can only help you based on what you tell us, and what code you push.  Don't be reluctant to push work-in-progress code to github; it's often easier to discuss problems based around some code you've tried making, even if that code doesn't work or is only a sketch of an idea.</div>


<div><br></div><div>Try and be present on IRC when you're working; asking questions as they come up there can be helpful.</div><div><div class="gmail_extra"><br><div class="gmail_quote">On 21 May 2014 19:11, Jiarong Wei <span dir="ltr"><<a href="mailto:vcamx3@gmail.com" target="_blank">vcamx3@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">


<p>Here are some questions I encountered these days,<br></p><p><span></span></p><p></p><ol><li>In letor.cc, we have two parts of functions: the training part and the ranking part. I’ll use SVMRanker as an example. The training part basically uses the libsvm library and training data to train a model, then save the model file. The ranking part will calculate score for each document in searching results (MSet) by using the trained model file. My question is for each of our three rankers: 1) SVMRanker 2) ListMLE 3) ListNet, do we need three different types of training part? (The ranking part for each of those have the same form I think) I’m not sure the parameters for these three different rankers are the same or not (I guess they’re different). In my understanding, the letor.cc basically just pass parameters ranker. It’s the ranker will do training and calculating things actually. So if we can generalize the form for training part, we don’t need functions like prepare_training_data_for_svm, prepare_training_data_for_listwise etc. We just need  prepare_training_data instead. (We can benefit from inheritance of ranker in training part just like in ranking part)</li>


</ol></div></blockquote><div>In general, I think we need a different training part for each ranker.  There may be some similarities in these existing rankers, and inheritance would be a sensible way to avoid duplicating code if so, but we'd like to have a framework which we can extend to a completely different type of ranker in future.</div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><ol>


<li>There is one thing I have to confirm: once we have the training model (like model file of SVMRanker), we won’t train that model again in general. (The behavior of questletor.cc under bin/ confuses me)<br></li></ol></div>


</blockquote><div>I'm not familiar with the behaviour of questletor, but I suppose it's reasonable to assume that we don't update models after initial creation.  It would be nice to be able to do so, but I think many training algorithms aren't updatable.  I feel I may be misunderstanding your question here, though.  Parth: any comment to add?</div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><ol><li>Since RankList will be removed, according to the meeting last week, its related information will be stored under MSet::Internal. My plan is to create new class under MSet::Internal. That class will have two kinds of feature vectors: normalized one and unnormalized one. Since it’s in MSet::Internal, there is a wrapper class outside it I think. So it also needs to provide corresponding APIs in that wrapper class. Also, the ranker will use MSet instead of RankList. Do you have any suggestions for this part?<br>


</li></ol></div></blockquote><div>This sounds like a reasonable approach.  This sounds like something you could implement very soon, and that is sufficiently standalone we could try and get it merged to master on its own.</div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><ol><li>For FeatureVector, I think it could be discarded since it just stores the information of feature vector of  each document, those information will be stored in the new class in MSet::Internal mentioned in 3.<br>


</li></ol></div></blockquote><div>Sounds right to me.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div dir="ltr"><ol>


<li>For Feature (letor_feature.cc), I think it could be a static class. It mainly focuses on the calculation of different features. For this part, I’m trying to figure out a better method to implement it. In the meeting last week, Olly and Parth suggested using a dispatching function to calculating different kinds of features because different features, like query-related feature and document feature, will use different parameters to calculate. By adopting this method, we should write down every calculating method in the same class, it’s a little hard to extend to use more features. If a user wants to use his own feature, he need to modify our source code instead of adding his own thing and making letor module use it, like implementing his own feature calculation class and call letor module to use it. I just think it’s not that convenient to extend features. In GSoC 2014, I also need to implement a feature selection algorithm so this part I think it’s kind of important, I mean the extensibility of features.<br>


</li></ol></div></blockquote><div>I can't remember the details of this but what you're suggesting sounds on the right lines.  We certainly want to design for easy extensibility.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div dir="ltr"><ol><li>For FeatureManager, it will set the context for feature calculation, like set Database, set query and what kinds of features we want. It provides some basic information like term frequency and inverse document frequency etc. Also it will have function update_mset to touch feature information to MSet.<br>


</li></ol></div></blockquote><div>Again, sounds plausible.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<div dir="ltr"><ol><li>For feature selection, I don’t know when to apply this selection. We will provide the features we want to use to FeatureManager. So the feature selection will provide some information like this feature is better so it will have larger weight? Or this algorithm will select subset of features we provide to generate feature vectors?<br>


</li></ol></div></blockquote><div>I'd expect the feature selection to select a subset of features: but it's also very good for it to be able to return information that a human can check over to see if it's making plausible decisions.</div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><ol><li>Do we have document about unit test? That’s also what Hanxiao is looking for.<br>


</li></ol></div></blockquote><div>We don't have many unit tests; there is xapian-core/tests/internaltest.cc which runs some tests that could be considered unit tests.  Mostly, our tests are what might be considered integration tests (ie, the apitest).  The tests were set up before many of the modern testing conventions became commonplace; it would be interesting to have a wider discussion about how we could make it easier to implement unit tests.</div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><ol><li>For automated tests, my idea is to use some data to test the functionality of letor module. It will also cover different configurations, like using different rankers, to test the functionality. I think I need some help for this part. Can someone give me some advice?</li>


</ol><p></p><p><span></span></p></div></blockquote><div>I'm not sure what advice you need; Parth - any ideas here?</div></div></div></div></div>