xapian-letor: FeatureVector discussion

Sat Jul 2 08:34:06 BST 2016

Making a Feature abstract class is indeed a good way than serialising them
using enum. Now I get it better. All other bits look fine to me. One way to
handle the features serialisation to be written into output files is to use
two separate files: i) the training file which is in the standard letor
format with ids starting from 1 and increasing and; ii) features properties
file with id mappings to the actual Feature sub-class. This might be less
resource intensive and dependencies compared to going to databases.

Cheers
Parth

On Thu, Jun 30, 2016 at 12:17 AM, James Aylett <james-xapian at tartarus.org>
wrote:

> On Wed, Jun 29, 2016 at 05:58:17PM +0530, Ayush Tomar wrote:
>
> > At present, letor is mostly centred around RankList (for both training
> and
> > ranking), whereas RankList is just a vector of FeatureVectors
> corresponding
> > to a qid. Having RankList in ranking has no meaning since qid isn't
> > required once the training part is over. (letor_rank(*) method in
> > letor_internal.cc supplies a junk qid to the RankList while performing
> the
> > ranking, which points out that the RankList approach isn't quite
> correct).
>
> Ah, I'd missed or forgotten that last detail. Getting rid of RankList
> for the output is therefore probably a good idea; in that case, we
> should return another MSet. (Or something that looks and behaves like
> one.)
>
> > Hence, RankList can be completely eliminated and instead we can have
> > FeatureVector work on top of FeatureManager directly. Am I right?
>
> I think we'll end up with FeatureVector, Feature (and its subclasses),
> and possibly one other class which might be called FeatureManager but
> which would be different to the current one in its
> responsibilities. (This is why I gave it a different name of
> FeatureList for the time being. That's probably less confusing than
> calling it 'Jeff' ;-)
>
> > The score in FeatureVector is simply the label, and fvals will be
> > returned by FeatureManager (by using feature values obtained from
> > each of the Feature sub-class).
>
> Again, this is FeatureList not FeatureManager. (The thing that makes
> fvals, except that it'll actually just make a FeatureVector for that
> Document(*). During preparation this doesn't matter, but a more direct
> connection during re-ranking of the MSet should make it easier to
> return something like an MSet, with the same ease of access to the
> Document object again.)
>
> (*) in the context of the relevant Query
>
> > >  * Features becomes FeatureList, but with some functionality from
> > >    FeatureManager. It's responsible for turning a Document into a
> > >    FeatureVector, for the letor system to operate on.
> >
> > FeatureList can tell the vector<Features*> FeatureList object in
> > FeatureManager as to what Features sub-classes to initialize.
>
> Or it can just have a vector<Feature> (or Feature&) that it uses
> directly; the FeatureList constructor will either initialise this with
> a default set of Feature objects, or take an iterator over them or
> something (we could have `add_feature(Feature&)` too). (Note that
> `Features`, with an 's', is a utility namespace at the moment which we
> should try to get rid of.)
>
> > A vector<double> fval(*) function in FeatureManager can operate over
> > vector<Features*> FeatureList to return fvals to the
> > FeatureVector. Maybe your meaning of FeatureList is something
> > different. Can you please explain?
>
> I was thinking more like:
>
> Document doc; // we have one of these already
> FeatureList flist = FeatureList(); // default Feature choice
> FeatureVector fvec = flist.create_fvec(doc);
>
> So we don't make a FeatureVector and then poke things into it, we just
> return one that represents a particular Document. The responsibility
> for making a FeatureVector out of a Document is the thing that knows
> which Feature objects to use, in which order, which is the FeatureList.
>
> > >  * Ranker should really be responsible for doing most of the work
> > >    currently done by Letor. (Preparing training files, training the
> > >    ranking algorithm &c.)
> >
> > Preparing training file is limited to FeatureVector calculation
> > only. Would there be a specific reason to include it in ranker?
>
> It feels to me slightly closer to that than anything
> else. Alternatively, it could just live in the Xapian::Letor namespace
> as a utility function.
>
> > > We shouldn't need serialisation of FeatureList or Feature, because
> > > this stuff doesn't have to persist, just be consistent, which is an
> > > issue very similar to getting prefixes right. Either a higher-level
> > > application has configuration, or it's in shared code between the
> > > indexing/training and querying/reranking parts of the system.
> >
> > I understand this now. Its the user's job to make it
> > consistent. Thanks for clarifying this.
>
> Note that if an application or library wants to serialise this
> configuration into the Xapian database, it can use Database metadata
> so it'll be carried around with the rest of the db. (Of course,
> chances are the trained data file for the SVM or whatever won't be
> stored in the same way, unless you also stuff them into metadata. The
> wisdom of doing this I don't know; Olly may have an opinion.)
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160702/91c89ff5/attachment.html>