[Xapian-devel] Discuss a few things about already implemented methods in Rishabh's branch

Sun Mar 9 11:10:22 GMT 2014

Hi Mayank,

> Before getting started to work on svmranker.cc, I need to discuss a few
> things.
>

Yes, it is a good idea to have insight of the framework before starting to
actually write something.

For *featurevector.h *-
>
> 1. I think it is a header file for the data-structure used for storing a
> query relevance though it has been mentioned there that This file
> responsible for transforming the document into the feature space<https://github.com/rishabhmehrotra/xapian/blob/master/xapian-letor/featurevector.h#L1>. Also all the methods there are
> *get* and *set* except *load_relevance*(). This same method is also
> present in featuremanager.h<https://github.com/rishabhmehrotra/xapian/blob/master/xapian-letor/featuremanager.h#L55>. Implementation wise too they are same. I can't find the reason why the
> same method is present in two headers.
> http://trac.xapian.org/wiki/GSoC2012/LTR/TODO also shows that there
> shouldn't be load_relevance() method in featurevector.h .
>

Some redundancy might be observed as the code is not scrubbed and actually
the project was unfortunately could not finish. Yes, load relevance lies
more naturally in featuremanager than featurevector.

Bascially the featurevector operates at a document level and the ranklist
operates at a query level. One query has many documents related to it. So
all those values which are common for all the documents will be in ranklist
and the information pertaining to the documents only will rest in
featurevector. Featuremanger does most of the job to construct
featurevector and fetch necessary statistics for it.

> 2. As it was mentioned in a mail by Jiarong Wei, the data member *label*should be of type
> *bool* rather than *double*. The data member *fcount* is also unused.
>

I just answered him that, many Letor datasets have more than two relevance
levels (Letor 3.0 and 4.0 have three relevance levels, Yahoo! Letor dataset
has 5). The idea behind keeping it double is when we have real number
relevance for the feature vector assigned by the ranking algorithm, it will
be stored on the same place. The evaluationmetric should sort the document
based on this number.

Yes, 'fcount' must be used and it is a TODO.

>
> 3. As it is a feature vector then there should be data member *queryid*but I found out that it is in
> ranklist.h<https://github.com/rishabhmehrotra/xapian/blob/master/xapian-letor/ranklist.h#L50>.
>

Just see the explanation to point 1.

>
> Other than that I wanted to know that has ListMLE and ListNet been tested?
> And what is autoencoder.cc for and where is the "dimred/ya_ate_dimred.h"
> header that has been included in it?
>

ListMLE and ListNet are not tested, also Rishabh did not mentioned their
performance. We have only the benchmark evaluation of svmranker. Just
ignore the autoencoder.cc because it was part of Rishabh's idea to add
unsupervised features using Deep learning in feature vector in addition to
conventional features.

Cheers,
Parth.

>
> -Mayank
>
> _______________________________________________
> Xapian-devel mailing list
> Xapian-devel at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140309/b9518e0a/attachment.html>