xapian-letor: FeatureVector discussion
Ayush Tomar
ayushtomar at gmail.com
Wed Jun 29 13:28:17 BST 2016
>
>
>
> The approach I was thinking would look something like this:
>
> * instead of Features, which is really a namespace implemented as a
> class, we separate out the calculation of the different features
> into distinct subclasses of Feature, whose only job is to calculate
> a single feature. Currently the FeatureManager calls these (via
> FeatureManager::Internal::transform) with the correct arguments,
> things like document statistics or tf or idf caches. This is
> analogous to how Weight objects can request various statistics, and
> the Enquire process then makes them available. So we can do it in a
> similar way (Feature declares that it needs tf and doclen, for
> instance, and FeatureManager can make sure they're available to the
> Feature when it's building a FeatureVector for a given document).
Yes. Features can get their own subdirectory with each Feature subclass
having its own implementation. We can have FeatureManager do all the
feature handling corresponding to a query. FeatureManager can have
vector<Features*> FeatureList, which initialises each Feature sub-class
object mentioned in the FeatureList (supplied at the time of
training/ranking or as deafult set).
At present, letor is mostly centred around RankList (for both training and
ranking), whereas RankList is just a vector of FeatureVectors corresponding
to a qid. Having RankList in ranking has no meaning since qid isn't
required once the training part is over. (letor_rank(*) method in
letor_internal.cc supplies a junk qid to the RankList while performing the
ranking, which points out that the RankList approach isn't quite correct).
Hence, RankList can be completely eliminated and instead we can have
FeatureVector work on top of FeatureManager directly. Am I right?
>
> * letor itself (during scoring) operates on FeatureVectors,
> representing Documents, and uses this to rerank an MSet; it does
> something similar during preparation of its training data. So how
> the FeatureVector is calculated just needs to be done the same
> in both situations.
>
Yes. At present Xapian::RankList create_rank_list(const Xapian::MSet &
mset, std::string & qid, bool train) defined in FeatureManager does the job
of preparing the FeatureVector for a query. At the time of preparing the
training file, FeatureVector calculation can be done while parsing the
query and qrel file, independently of RankList. Therefore, eliminating the
need of maintaing a global qrel storage (map<string, map<string, int> >
qrel; in FeatureManager::Internal) and thus eliminating the need of
load_relevance(*) and getlabel(*) functions. The score in FeatureVector is
simply the label, and fvals will be returned by FeatureManager (by using
feature values obtained from each of the Feature sub-class).
While ranking, FeatureVector fvals will be computed similarly by
FeatureManager, while the score gets assigned later at the time of ranking.
> * when configuring the letor system either for training or for
> reranking, we construct a FeatureList(*) (which is basically a
> vector<Feature>), which we can later ask to generate a
> FeatureVector for a given document. (This splits some of the
> functionality of FeatureManager, but makes it more clear what each
> piece does.)
> * if you just construct a FeatureList, you'll get whatever the
> defaults should be. If you want to set your own features, you do
> that at construction time. That can include custom features, which
> wouldn't be possible under the enum model without editing
> xapian-letor and rebuilding it, which isn't friendly to
> developers.
>
> * Features becomes FeatureList, but with some functionality from
> FeatureManager. It's responsible for turning a Document into a
> FeatureVector, for the letor system to operate on.
>
FeatureList can tell the vector<Features*> FeatureList object in
FeatureManager as to what Features sub-classes to initialize. A
vector<double> fval(*) function in FeatureManager can operate over
vector<Features*> FeatureList to return fvals to the FeatureVector. Maybe
your meaning of FeatureList is something different. Can you please explain?
>
> * Ranker should really be responsible for doing most of the work
> currently done by Letor. (Preparing training files, training the
> ranking algorithm &c.)
>
Preparing training file is limited to FeatureVector calculation only. Would
there be a specific reason to include it in ranker?
>
> * The rest of FeatureManager is really utilities (which can be
> functions in the Xapian::Letor namespace, or methods on whichever
> class makes sense). For instance, load_relevance() has nothing to
> do with features; it's part of the training stage. (It's also on
> FeatureVector, with effectively the same implementation.)
>
Yes. These functions are not defined at correct places. If we decide to
eliminate RankList, most of these methods will have no meaning and
therefore will be removed.
>
> * RankList is mostly a list of FeatureVectors, ie it's close to the
> thing we care about at the end. The final output we want is
> actually a ranked list of Documents, but this is almost the same
> thing.
>
> (*) FeatureList isn't a great name, but I didn't want to adopt
> anything too close to what we have, as it'd be confusing.
>
As I have asked above, I think there is a misunderstanding between my
interpretation of FeatureList and yours. Please correct me where I am wrong.
>
> We shouldn't need serialisation of FeatureList or Feature, because
> this stuff doesn't have to persist, just be consistent, which is an
> issue very similar to getting prefixes right. Either a higher-level
> application has configuration, or it's in shared code between the
> indexing/training and querying/reranking parts of the system.
>
I understand this now. Its the user's job to make it consistent. Thanks for
clarifying this.
J
>
> --
> James Aylett, occasional trouble-maker
> xapian.org
>
>
--
----------------------------------------------------------------------------
Kind Regards,
Ayush Tomar | My Webpage <http://ayshtmr.xyz> | LinkedIn
<https://in.linkedin.com/in/ayushtomar>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160629/c7664970/attachment.html>
More information about the Xapian-devel
mailing list