<div dir="ltr"><div class="gmail_default" style="color:rgb(11,83,148)">Making a Feature abstract class is indeed a good way than serialising them using enum. Now I get it better. All other bits look fine to me. One way to handle the features serialisation to be written into output files is to use two separate files: i) the training file which is in the standard letor format with ids starting from 1 and increasing and; ii) features properties file with id mappings to the actual Feature sub-class. This might be less resource intensive and dependencies compared to going to databases.<br><br></div><div class="gmail_default" style="color:rgb(11,83,148)">Cheers<br></div><div class="gmail_default" style="color:rgb(11,83,148)">Parth<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jun 30, 2016 at 12:17 AM, James Aylett <span dir="ltr"><<a href="mailto:james-xapian@tartarus.org" target="_blank">james-xapian@tartarus.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Jun 29, 2016 at 05:58:17PM +0530, Ayush Tomar wrote:<br>
<br>
> At present, letor is mostly centred around RankList (for both training and<br>
> ranking), whereas RankList is just a vector of FeatureVectors corresponding<br>
> to a qid. Having RankList in ranking has no meaning since qid isn't<br>
> required once the training part is over. (letor_rank(*) method in<br>
> letor_internal.cc supplies a junk qid to the RankList while performing the<br>
> ranking, which points out that the RankList approach isn't quite correct).<br>
<br>
</span>Ah, I'd missed or forgotten that last detail. Getting rid of RankList<br>
for the output is therefore probably a good idea; in that case, we<br>
should return another MSet. (Or something that looks and behaves like<br>
one.)<br>
<span class=""><br>
> Hence, RankList can be completely eliminated and instead we can have<br>
> FeatureVector work on top of FeatureManager directly. Am I right?<br>
<br>
</span>I think we'll end up with FeatureVector, Feature (and its subclasses),<br>
and possibly one other class which might be called FeatureManager but<br>
which would be different to the current one in its<br>
responsibilities. (This is why I gave it a different name of<br>
FeatureList for the time being. That's probably less confusing than<br>
calling it 'Jeff' ;-)<br>
<span class=""><br>
> The score in FeatureVector is simply the label, and fvals will be<br>
> returned by FeatureManager (by using feature values obtained from<br>
> each of the Feature sub-class).<br>
<br>
</span>Again, this is FeatureList not FeatureManager. (The thing that makes<br>
fvals, except that it'll actually just make a FeatureVector for that<br>
Document(*). During preparation this doesn't matter, but a more direct<br>
connection during re-ranking of the MSet should make it easier to<br>
return something like an MSet, with the same ease of access to the<br>
Document object again.)<br>
<br>
(*) in the context of the relevant Query<br>
<span class=""><br>
> > * Features becomes FeatureList, but with some functionality from<br>
> > FeatureManager. It's responsible for turning a Document into a<br>
> > FeatureVector, for the letor system to operate on.<br>
><br>
> FeatureList can tell the vector<Features*> FeatureList object in<br>
> FeatureManager as to what Features sub-classes to initialize.<br>
<br>
</span>Or it can just have a vector<Feature> (or Feature&) that it uses<br>
directly; the FeatureList constructor will either initialise this with<br>
a default set of Feature objects, or take an iterator over them or<br>
something (we could have `add_feature(Feature&)` too). (Note that<br>
`Features`, with an 's', is a utility namespace at the moment which we<br>
should try to get rid of.)<br>
<span class=""><br>
> A vector<double> fval(*) function in FeatureManager can operate over<br>
> vector<Features*> FeatureList to return fvals to the<br>
> FeatureVector. Maybe your meaning of FeatureList is something<br>
> different. Can you please explain?<br>
<br>
</span>I was thinking more like:<br>
<br>
Document doc; // we have one of these already<br>
FeatureList flist = FeatureList(); // default Feature choice<br>
FeatureVector fvec = flist.create_fvec(doc);<br>
<br>
So we don't make a FeatureVector and then poke things into it, we just<br>
return one that represents a particular Document. The responsibility<br>
for making a FeatureVector out of a Document is the thing that knows<br>
which Feature objects to use, in which order, which is the FeatureList.<br>
<span class=""><br>
> > * Ranker should really be responsible for doing most of the work<br>
> > currently done by Letor. (Preparing training files, training the<br>
> > ranking algorithm &c.)<br>
><br>
> Preparing training file is limited to FeatureVector calculation<br>
> only. Would there be a specific reason to include it in ranker?<br>
<br>
</span>It feels to me slightly closer to that than anything<br>
else. Alternatively, it could just live in the Xapian::Letor namespace<br>
as a utility function.<br>
<span class=""><br>
> > We shouldn't need serialisation of FeatureList or Feature, because<br>
> > this stuff doesn't have to persist, just be consistent, which is an<br>
> > issue very similar to getting prefixes right. Either a higher-level<br>
> > application has configuration, or it's in shared code between the<br>
> > indexing/training and querying/reranking parts of the system.<br>
><br>
> I understand this now. Its the user's job to make it<br>
> consistent. Thanks for clarifying this.<br>
<br>
</span>Note that if an application or library wants to serialise this<br>
configuration into the Xapian database, it can use Database metadata<br>
so it'll be carried around with the rest of the db. (Of course,<br>
chances are the trained data file for the SVM or whatever won't be<br>
stored in the same way, unless you also stuff them into metadata. The<br>
wisdom of doing this I don't know; Olly may have an opinion.)<br>
<div class="HOEnZb"><div class="h5"><br>
J<br>
<br>
--<br>
James Aylett, occasional trouble-maker<br>
<a href="http://xapian.org" rel="noreferrer" target="_blank">xapian.org</a><br>
<br>
</div></div></blockquote></div><br></div></div>