[Xapian-devel] Feature Selection algorithm

Mon Mar 17 03:41:19 GMT 2014

On Mon, Mar 17, 2014 at 01:51:13AM +0530, Mayank Chaudhary wrote:
> In this research
> paper<http://research.microsoft.com/en-us/people/tyliu/fsr.pdf> of
> feature selection algorithm for ranking, the importance scores of features
> is described as-
> 
> *"*We first assign an importance score to each feature. Specifically, we
> propose using an evaluation measure like MAP and NDCG (the definitions of
> them will be given in Section 3) or a loss function (e.g. pair-wise ranking
> errors [10][13]) to compute the importance score. In the former, we first *rank
> instances(1)* using the feature, evaluate the performance in terms of the
> measure, and then take the evaluation result as the importance score. In
> the latter, we also rank instances using the feature, and then view a score
> inversely proportional to the corresponding loss as the importance score.
> Note that for some features larger values correspond to higher ranks while
> for other features smaller values correspond to higher ranks, when
> calculating MAP, NDCG or the loss of ranking models, *we actually sort the
> instances for two **times (in the normal order and in the inverse order),
> and take the **larger score as the importance score of the feature.(2)**"*
> 
> 1. Is it Ok if we rank them with SVMRanker. SVMRanker is a linear kernel
> SVM so how did you tune the parameter C(penalty for error term)? Did you
> use Grid Search for C?
> 
> 2. I couldn't understand what they mean by these lines in bold. Could you
> please explain me?

I think the first one is saying that they use the feature alone to rank
results (e.g. order results purely by BM25 for the BM25 feature), and
see how well that does in terms of MAP and NDCG.

But because some features are better when larger (e.g. BM25 score) and
some are better when smaller (I don't have a great example to hand, but
perhaps edit distance between query and document in some systems), and
there's nothing specified which says which is the case for a given
feature, they try measuring MAP and NDCG for the documents ranked both
ways and pick the better scores.

It would probably be worthwhile providing hints for features where we
are sure which way is better (there's really no point reverse ranking
with BM25 and measuring MAP and NDCG), but some features might work in
opposite directions in different situations - e.g. document length:
longer documents might tend to be more relevant in one system, but in
another shorter documents might tend to be more relevant.

> *PS*: I've send a proposal for Letor. It'll be great if you could review it
> and tell me if any detail is missing or I've missed out something so that I
> can improve upon it.

I've made some comments on it (in case email notifications still aren't
working).

Cheers,
    Olly