xapian-letor: FeatureVector discussion

Wed Jun 29 19:47:52 BST 2016

On Wed, Jun 29, 2016 at 05:58:17PM +0530, Ayush Tomar wrote:

> At present, letor is mostly centred around RankList (for both training and
> ranking), whereas RankList is just a vector of FeatureVectors corresponding
> to a qid. Having RankList in ranking has no meaning since qid isn't
> required once the training part is over. (letor_rank(*) method in
> letor_internal.cc supplies a junk qid to the RankList while performing the
> ranking, which points out that the RankList approach isn't quite correct).

Ah, I'd missed or forgotten that last detail. Getting rid of RankList
for the output is therefore probably a good idea; in that case, we
should return another MSet. (Or something that looks and behaves like
one.)

> Hence, RankList can be completely eliminated and instead we can have
> FeatureVector work on top of FeatureManager directly. Am I right?

I think we'll end up with FeatureVector, Feature (and its subclasses),
and possibly one other class which might be called FeatureManager but
which would be different to the current one in its
responsibilities. (This is why I gave it a different name of
FeatureList for the time being. That's probably less confusing than
calling it 'Jeff' ;-)

> The score in FeatureVector is simply the label, and fvals will be
> returned by FeatureManager (by using feature values obtained from
> each of the Feature sub-class).

Again, this is FeatureList not FeatureManager. (The thing that makes
fvals, except that it'll actually just make a FeatureVector for that
Document(*). During preparation this doesn't matter, but a more direct
connection during re-ranking of the MSet should make it easier to
return something like an MSet, with the same ease of access to the
Document object again.)

(*) in the context of the relevant Query

> >  * Features becomes FeatureList, but with some functionality from
> >    FeatureManager. It's responsible for turning a Document into a
> >    FeatureVector, for the letor system to operate on.
> 
> FeatureList can tell the vector<Features*> FeatureList object in
> FeatureManager as to what Features sub-classes to initialize.

Or it can just have a vector<Feature> (or Feature&) that it uses
directly; the FeatureList constructor will either initialise this with
a default set of Feature objects, or take an iterator over them or
something (we could have `add_feature(Feature&)` too). (Note that
`Features`, with an 's', is a utility namespace at the moment which we
should try to get rid of.)

> A vector<double> fval(*) function in FeatureManager can operate over
> vector<Features*> FeatureList to return fvals to the
> FeatureVector. Maybe your meaning of FeatureList is something
> different. Can you please explain?

I was thinking more like:

Document doc; // we have one of these already
FeatureList flist = FeatureList(); // default Feature choice
FeatureVector fvec = flist.create_fvec(doc);

So we don't make a FeatureVector and then poke things into it, we just
return one that represents a particular Document. The responsibility
for making a FeatureVector out of a Document is the thing that knows
which Feature objects to use, in which order, which is the FeatureList.

> >  * Ranker should really be responsible for doing most of the work
> >    currently done by Letor. (Preparing training files, training the
> >    ranking algorithm &c.)
> 
> Preparing training file is limited to FeatureVector calculation
> only. Would there be a specific reason to include it in ranker?

It feels to me slightly closer to that than anything
else. Alternatively, it could just live in the Xapian::Letor namespace
as a utility function.

> > We shouldn't need serialisation of FeatureList or Feature, because
> > this stuff doesn't have to persist, just be consistent, which is an
> > issue very similar to getting prefixes right. Either a higher-level
> > application has configuration, or it's in shared code between the
> > indexing/training and querying/reranking parts of the system.
> 
> I understand this now. Its the user's job to make it
> consistent. Thanks for clarifying this.

Note that if an application or library wants to serialise this
configuration into the Xapian database, it can use Database metadata
so it'll be carried around with the rest of the db. (Of course,
chances are the trained data file for the SVM or whatever won't be
stored in the same way, unless you also stuff them into metadata. The
wisdom of doing this I don't know; Olly may have an opinion.)

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org