[Xapian-devel] Introduction and Discussion for Learning to Rank Framework

Tue Jun 7 07:54:12 BST 2011

Hello All,

This time we are working on a weighting scheme "Learning to Rank" which
involves machine learning and its a supervised ranking scheme unlike
unsupervised schemes like BM25 under GSoC project.

This mail intends to discuss the framework of the Learning to Rank in Xapian
as a whole. I have thought of the following framework, pour in your insights
or issues for the same. This is also put on the wiki for the rererencen
[Link] <http://trac.xapian.org/wiki/GSoC2011/LTR/LTRFramework>.

First of all I put the structure and tentative elements[methods] of the
Xapian::Letor class and then will explain the whole flow of the 'Letor'
ranking.

Structure of the Xapian::Letor class

methods:

Following Five methods basically prepare the statistical information needed
to generate the values of desired features.

tf()

idf()

doc_len()

coll_tf()

coll_len()

Following six methods will calucate the particular feature values with the
help of above statistical information. Here char ch plays a roll to tell the
method that this features value is to be calculated for which documents
part. For example [ ch = 't' says calculate feature 1 for title only , 'b' -
body only , 'w' - whole document.

calculate_f1(... , char ch)

calculate_f2(... , char ch)

calculate_f3(... , char ch)

calculate_f4(... , char ch)

calculate_f5(... , char ch)

calculate_f6(... , char ch)

Following methods are very general methods which uses above all the methods
in order to make a platform between xapian and Machine Learning.

void prepare_training_file() - This method prepares the training file which
is used by the Machine Learning(ML) Algorithm in our case SVM[Support Vector
Machine Tool] to build the model. This method will be made public because
user can prepare its own training file suitable to his/her application,
supplying necessary things like index[dataset/corpus], query file and their
relevance judgements file. This method will generate the training file if
the desired input files are in standard format which will be made public.
Though we will provide the training file and model file with the standard
distributions which is quite general for direct use.

void learn_model() - this method trains the SVM model with the use of
training file, and writes the model into a model file which can be used to
rank the document.

double learn_score() - Here we may calculate the score of the document using
the model file with the generated vector.

Flow of the program: We have to make a quest like utility say questletor,
here we will get the initial ranklist in MSet. Now for each document we will
get the feature values. Calculation of this features may be placed in the
learn_score() method or there can be an independent method like
make_feature_vector() output of which can be passed to learn_score() method.
This method will return the Letor score of the document. New Ranklist based
on the Letor Scores will be returned.

methods like prepare_training_file() and learn_model() will come into
picture only when user wishes to use his/her own data or want different type
of domain dependent learning. I would like to mention that the training file
and model file will be distributed along the standard version and they are
quite general for the purpose because the data used is wikipedia and also
the training file will be normalised at query level in order to make all the
variations of queries like general queries or specific queries etc
identical. Another thing is only those documents are considered for training
file whose relevance judgements are available.

Extendibility of the Structure: This framework can be extended in two ways -
with respect to features and other is with respect to Machine Learning(ML)
Algorithm. For the features, methods like calculate_f1() can be written for
new features and can be called from the area where final vector is being
made. ML algorithms demands that training file vectors and testing vectors
should be of same dimension. So if we want to add/remove features, new
training file needs to be generated which is very straight forward because
we will have to just add/drop calculate_f() kind of methods' calls. The
format we have selected of the training file is very standard and common to
most of the Letor tools available and if it demands the data in different
format then we can write multiple training file in each suitable format.

Regards,
Parth Gupta,
http://sites.google.com/site/parthg88
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110607/6a071ec4/attachment.htm>