xapian-letor: FeatureVector discussion

Tue Jun 28 11:31:39 BST 2016

On Mon, Jun 27, 2016 at 07:19:15PM +0530, Ayush Tomar wrote:

> James might have something to say on the second approach. It wasn't
> discussed in detail and I don't completely understand how things will work
> here without having some sort of serialisation.

I suspect we should hash out the details of both to make it easier to
compare. The current API around features I'm not happy with for
various reasons; it makes things into objects which aren't really
objects, and has all the different feature calculations done as
methods, when they're really distinct things.

I don't think we need to settle this ahead of getting the rest of the
code incorporated; getting this right is subtle work, and will take a
while.

The approach I was thinking would look something like this:

 * instead of Features, which is really a namespace implemented as a
   class, we separate out the calculation of the different features
   into distinct subclasses of Feature, whose only job is to calculate
   a single feature. Currently the FeatureManager calls these (via
   FeatureManager::Internal::transform) with the correct arguments,
   things like document statistics or tf or idf caches. This is
   analogous to how Weight objects can request various statistics, and
   the Enquire process then makes them available. So we can do it in a
   similar way (Feature declares that it needs tf and doclen, for
   instance, and FeatureManager can make sure they're available to the
   Feature when it's building a FeatureVector for a given document).

 * letor itself (during scoring) operates on FeatureVectors,
   representing Documents, and uses this to rerank an MSet; it does
   something similar during preparation of its training data. So how
   the FeatureVector is calculated just needs to be done the same
   in both situations.

 * when configuring the letor system either for training or for
   reranking, we construct a FeatureList(*) (which is basically a
   vector<Feature>), which we can later ask to generate a
   FeatureVector for a given document. (This splits some of the
   functionality of FeatureManager, but makes it more clear what each
   piece does.)

 * if you just construct a FeatureList, you'll get whatever the
   defaults should be. If you want to set your own features, you do
   that at construction time. That can include custom features, which
   wouldn't be possible under the enum model without editing
   xapian-letor and rebuilding it, which isn't friendly to
   developers.

 * Features becomes FeatureList, but with some functionality from
   FeatureManager. It's responsible for turning a Document into a
   FeatureVector, for the letor system to operate on.

 * Ranker should really be responsible for doing most of the work
   currently done by Letor. (Preparing training files, training the
   ranking algorithm &c.) 

 * The rest of FeatureManager is really utilities (which can be
   functions in the Xapian::Letor namespace, or methods on whichever
   class makes sense). For instance, load_relevance() has nothing to
   do with features; it's part of the training stage. (It's also on
   FeatureVector, with effectively the same implementation.)

 * RankList is mostly a list of FeatureVectors, ie it's close to the
   thing we care about at the end. The final output we want is
   actually a ranked list of Documents, but this is almost the same
   thing.

(*) FeatureList isn't a great name, but I didn't want to adopt
anything too close to what we have, as it'd be confusing.

We shouldn't need serialisation of FeatureList or Feature, because
this stuff doesn't have to persist, just be consistent, which is an
issue very similar to getting prefixes right. Either a higher-level
application has configuration, or it's in shared code between the
indexing/training and querying/reranking parts of the system.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org