[Xapian-devel] Some questions about Letor project

Wed May 21 19:11:37 BST 2014

Hi all,

Thank you for giving me the opportunity to work with Xapian :) I am Jiarong
Wei, a third year undergraduate student in Zhejiang University, China. In
GSoC 2014, I will work on Letor module with Hanxiao Sun.

Here are some questions I encountered these days,

   1. In letor.cc, we have two parts of functions: the training part and
   the ranking part. I’ll use SVMRanker as an example. The training part
   basically uses the libsvm library and training data to train a model, then
   save the model file. The ranking part will calculate score for each
   document in searching results (MSet) by using the trained model file. My
   question is for each of our three rankers: 1) SVMRanker 2) ListMLE 3)
   ListNet, do we need three different types of training part? (The ranking
   part for each of those have the same form I think) I’m not sure the
   parameters for these three different rankers are the same or not (I guess
   they’re different). In my understanding, the letor.cc basically just pass
   parameters ranker. It’s the ranker will do training and calculating things
   actually. So if we can generalize the form for training part, we don’t need
   functions like prepare_training_data_for_svm,
   prepare_training_data_for_listwise etc. We just need  prepare_training_data
   instead. (We can benefit from inheritance of ranker in training part just
   like in ranking part)
   2. There is one thing I have to confirm: once we have the training model
   (like model file of SVMRanker), we won’t train that model again in general.
   (The behavior of questletor.cc under bin/ confuses me)
   3. Since RankList will be removed, according to the meeting last week,
   its related information will be stored under MSet::Internal. My plan is to
   create new class under MSet::Internal. That class will have two kinds of
   feature vectors: normalized one and unnormalized one. Since it’s in
   MSet::Internal, there is a wrapper class outside it I think. So it also
   needs to provide corresponding APIs in that wrapper class. Also, the ranker
   will use MSet instead of RankList. Do you have any suggestions for this
   part?
   4. For FeatureVector, I think it could be discarded since it just stores
   the information of feature vector of  each document, those information will
   be stored in the new class in MSet::Internal mentioned in 3.
   5. For Feature (letor_feature.cc), I think it could be a static class.
   It mainly focuses on the calculation of different features. For this part,
   I’m trying to figure out a better method to implement it. In the meeting
   last week, Olly and Parth suggested using a dispatching function to
   calculating different kinds of features because different features, like
   query-related feature and document feature, will use different parameters
   to calculate. By adopting this method, we should write down every
   calculating method in the same class, it’s a little hard to extend to use
   more features. If a user wants to use his own feature, he need to modify
   our source code instead of adding his own thing and making letor module use
   it, like implementing his own feature calculation class and call letor
   module to use it. I just think it’s not that convenient to extend features.
   In GSoC 2014, I also need to implement a feature selection algorithm so
   this part I think it’s kind of important, I mean the extensibility of
   features.
   6. For FeatureManager, it will set the context for feature calculation,
   like set Database, set query and what kinds of features we want. It
   provides some basic information like term frequency and inverse document
   frequency etc. Also it will have function update_mset to touch feature
   information to MSet.
   7. For feature selection, I don’t know when to apply this selection. We
   will provide the features we want to use to FeatureManager. So the feature
   selection will provide some information like this feature is better so it
   will have larger weight? Or this algorithm will select subset of features
   we provide to generate feature vectors?
   8. Do we have document about unit test? That’s also what Hanxiao is
   looking for.
   9. For automated tests, my idea is to use some data to test the
   functionality of letor module. It will also cover different configurations,
   like using different rankers, to test the functionality. I think I need
   some help for this part. Can someone give me some advice?

Thanks for your help :)

Jiarong Wei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140522/127eb712/attachment.html>