xapian-letor: FeatureVector discussion
James Aylett
james-xapian at tartarus.org
Tue Jun 28 11:31:39 BST 2016
On Mon, Jun 27, 2016 at 07:19:15PM +0530, Ayush Tomar wrote:
> James might have something to say on the second approach. It wasn't
> discussed in detail and I don't completely understand how things will work
> here without having some sort of serialisation.
I suspect we should hash out the details of both to make it easier to
compare. The current API around features I'm not happy with for
various reasons; it makes things into objects which aren't really
objects, and has all the different feature calculations done as
methods, when they're really distinct things.
I don't think we need to settle this ahead of getting the rest of the
code incorporated; getting this right is subtle work, and will take a
while.
The approach I was thinking would look something like this:
* instead of Features, which is really a namespace implemented as a
class, we separate out the calculation of the different features
into distinct subclasses of Feature, whose only job is to calculate
a single feature. Currently the FeatureManager calls these (via
FeatureManager::Internal::transform) with the correct arguments,
things like document statistics or tf or idf caches. This is
analogous to how Weight objects can request various statistics, and
the Enquire process then makes them available. So we can do it in a
similar way (Feature declares that it needs tf and doclen, for
instance, and FeatureManager can make sure they're available to the
Feature when it's building a FeatureVector for a given document).
* letor itself (during scoring) operates on FeatureVectors,
representing Documents, and uses this to rerank an MSet; it does
something similar during preparation of its training data. So how
the FeatureVector is calculated just needs to be done the same
in both situations.
* when configuring the letor system either for training or for
reranking, we construct a FeatureList(*) (which is basically a
vector<Feature>), which we can later ask to generate a
FeatureVector for a given document. (This splits some of the
functionality of FeatureManager, but makes it more clear what each
piece does.)
* if you just construct a FeatureList, you'll get whatever the
defaults should be. If you want to set your own features, you do
that at construction time. That can include custom features, which
wouldn't be possible under the enum model without editing
xapian-letor and rebuilding it, which isn't friendly to
developers.
* Features becomes FeatureList, but with some functionality from
FeatureManager. It's responsible for turning a Document into a
FeatureVector, for the letor system to operate on.
* Ranker should really be responsible for doing most of the work
currently done by Letor. (Preparing training files, training the
ranking algorithm &c.)
* The rest of FeatureManager is really utilities (which can be
functions in the Xapian::Letor namespace, or methods on whichever
class makes sense). For instance, load_relevance() has nothing to
do with features; it's part of the training stage. (It's also on
FeatureVector, with effectively the same implementation.)
* RankList is mostly a list of FeatureVectors, ie it's close to the
thing we care about at the end. The final output we want is
actually a ranked list of Documents, but this is almost the same
thing.
(*) FeatureList isn't a great name, but I didn't want to adopt
anything too close to what we have, as it'd be confusing.
We shouldn't need serialisation of FeatureList or Feature, because
this stuff doesn't have to persist, just be consistent, which is an
issue very similar to getting prefixes right. Either a higher-level
application has configuration, or it's in shared code between the
indexing/training and querying/reranking parts of the system.
J
--
James Aylett, occasional trouble-maker
xapian.org
More information about the Xapian-devel
mailing list