[Xapian-devel] [GSoC2012] Learning to Rank: few thoughts/issues

Wed Apr 4 09:38:41 BST 2012

Hi Rishabh,

> To sum up, in addition to the a subset of 134 of MSR's LETOR features,
> we'll use the unsupervised deep learning features to train our model based
> on ListMLE algorithm. As each of this would be implemented in a modular
> fashion, if we get bad results(which shouldn't happen), we'll eliminate
> that module(deep learning one) and keep the remaining; if we get better
> results(which we should), we'll have a research contribution to make to the
> community because Deep Learning techniques haven't been applied to ranking
> systems yet.
>

Okay, this seems feasible that, deep learning module is optional. That is
sane to see the ListMLE as the least outcome of the project. Looking at the
application deadline, I would like to see all the details in a proper
application format and we can discuss on top of that.

Best,
Parth.

>
> Regards,
> Rishabh.
>
>
> On Wed, Apr 4, 2012 at 1:19 PM, Parth Gupta <parthg.88 at gmail.com> wrote:
>
>> Hi Rishabh,
>>
>> If I have understood correctly ( thought I am not sure), do you want to
>> extract features using some of the deep learning techniques in the
>> semi-supervised setting of Letor. Can you give me some idea about what
>> these features are?
>>
>> Are you planning to implement ListMLE in addition to this?
>>
>> Parth.
>>
>>
>> On Mon, Apr 2, 2012 at 10:03 PM, Rishabh Mehrotra <erishabh at gmail.com>wrote:
>>
>>> Thanks Parth for your inputs. I had gone through the 134 or so feature
>>> list at MSR's LETOR site for the ranking algorithms. The traditional IR
>>> features I was talking about in my previous mail were referring to those
>>> ones.
>>>
>>> Thanks for the heads-up on the unsupervised features part. I will be
>>> more inclined on using one of the recently famous *deep learning
>>> architectures* for unsupervised feature extraction simply because they
>>> have been shown to outperform hand-crafted feature based state-of-the-art
>>> algorithms. PFA a snapshot<Deep Learning.png> from one of the tutorials by
>>> Andrew Ng highlighting the same fact.
>>>
>>> I would like to discuss about the probable methodology regarding the
>>> same.
>>>
>>> *Methodology:*
>>> In order to do unsupervised feature extraction from the documents, I'll
>>> use Denoising Stacked Autoencoders. Its a 2 stage process:
>>>
>>>    - Unsupervised pre-training
>>>    - Supervised fine tuning
>>>
>>> *Unsupervised pre-training:*
>>>
>>> The denoising autoencoders[link<http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity>]
>>> can be stacked to form a deep network by feeding the latent representation
>>> (output code) of the denoising auto-encoder found on the layer below as
>>> input to the current layer. The *unsupervised pre-training* of such an
>>> architecture is done one layer at a time. Each layer is trained as a
>>> denoising auto-encoder by minimizing the reconstruction of its input (which
>>> is the output code of the previous layer). Once the first *n* layers
>>> are trained, we can train the *n+1-th* layer because we can now compute
>>> the code or latent representation from the layer below.
>>> The kind of representation learnt by this step is a rich representation
>>> which has shown to be better than hand-crafted features/representations.
>>>
>>>
>>> *Supervised fine tuning:*
>>>
>>> Once all layers are pre-trained, the network goes through a second stage
>>> of training called fine-tuning. Generally we consider supervised
>>> fine-tuning where we want to minimize prediction error on a supervised
>>> task.
>>> *For our problem statement(ranking):* we would use the features
>>> generated in the unsupervised pre-training stage and add the previously
>>> used IR features(a subset of the 136 LETOR features) and feed this to
>>> ListMLE or other Learning to Rank algorithm.
>>>
>>> To sum up, we would perform unsupervised pre-training to get features
>>> for all the documents(unlabeled as well as labeled) and then proceed to do
>>> supervised fine tuning only on the labeled data/documents. This way we
>>> would use both labelled and unlabeled data to learn features to represent
>>> documents. I have worked a bit on a related project, so the integration of
>>> such a combination (Unsupervised feature extraction+ListMLE) into xapian
>>> shouldn't be a tough asking(hopefully) as a GSoC project.
>>>
>>>
>>> Your inputs are welcome. Thanks for your time.
>>>
>>> -
>>> Rishabh.
>>>
>>> PS: A lot of hot-research is going on in the Deep Learning field. Here's
>>> a link <http://deeplearningworkshopnips2011.wordpress.com/> to a
>>> NIPS'11(Tier-1 in ML) workshop on the same for everyone's reference.
>>>
>>>
>>> On Mon, Apr 2, 2012 at 3:52 PM, Parth Gupta <parthg.88 at gmail.com> wrote:
>>>
>>>> Hello Rishabh,
>>>>
>>>> Good to hear from you. Its never late to jump-in for GSoC.
>>>>
>>>>  *Doubt1:*
>>>>>
>>>>> *Feature Extraction/Selection:*
>>>>> The various datasets listed on MSR's LETOR have a limited set of
>>>>> features. Current implementation in xapian's LETOR has 5
>>>>> features[tf,idf,doc_len,coll_tf,coll_len]. While algorithms for learning
>>>>> ranking models have been intensively studied, this is not the case for
>>>>> feature selection, despite of its importance. In a paper presented at
>>>>> SIGIR'07 [Tier1 in IR domain], the authors have highlighted the
>>>>> effectiveness of feature selection methods for ranking tasks.[link<http://research.microsoft.com/en-us/people/tyliu/fsr.pdf>]
>>>>> I believe that apart from the traditional/cliched IR features, we should
>>>>> * incorporate new features* to improve the performance of the LETOR
>>>>> module.
>>>>>
>>>>>
>>>> There is no point denying the fact that there is a need for more
>>>> features. If you have noticed on the GSoC idea page of Letor, it says "The
>>>> project can also include some work on the features, like adding support for
>>>> more features, selecting a subset of features, etc." Now the point comes,
>>>> which features you want to incorporate. The Letor datasets are growing
>>>> enormously in terms of number of features [Letor MSR 46 ->136 , Yahoo
>>>> Dataset 700]. It would make sense to incorporate those features which can
>>>> be tracked and suits the environment. More over majority of the features
>>>> dwell around the IR measures like bm25, TF, IDF, LM and different
>>>> combination of them for different part of the document. Some of the other
>>>> features of Letor Datasets are number of outgoing links, number of incoming
>>>> links, page rank, number of children [1,2]. These features are valid and
>>>> available only in the linked data and moreover, straight forward to
>>>> compute. Yahoo dataset does not even declare the features because of the
>>>> proprietary issues. But I think it also includes some features like the age
>>>> of the page, number of clicks on it, total time spent, and so on.
>>>>
>>>> [1]
>>>> http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/Features_in_LETOR4.pdf
>>>> [2] http://research.microsoft.com/en-us/projects/mslr/feature.aspx
>>>>
>>>>>
>>>>> *Using unlabeled data:*
>>>>> Over the last 3-4 years a lot of papers have identified the importance
>>>>> of using unlabeled data to assist the task at hand by using it during
>>>>> feature extraction stage. Andrew Ng proposed a Self-Taught learning
>>>>> framework[ICML'07 paper<http://ai.stanford.edu/%7Ehllee/icml07-selftaughtlearning.pdf>]
>>>>> wherein they make use of unlabeled data to improve performance. A very
>>>>> recent paper at ICML'11<http://eprints.pascal-network.org/archive/00008597/01/342_icmlpaper.pdf>used the advantage of feature learning using unlabeled data and beat the
>>>>> state-of-the-art in sentiment classification.
>>>>>
>>>>> Combining the above two points, I suggest an approach which uses
>>>>> features learnt from data in an unsupervised fashion "*in addition to*"
>>>>> the commonly used features.
>>>>> *Please note:* all this is in addition to the traditional features
>>>>> and finally we would be using *listwise/pairwise approaches*[ListMLE,
>>>>> et cetera] to train our models on the new set of features. Please let me
>>>>> know if this sounds good.
>>>>>
>>>>>
>>>> This phenomenon, Semi-supervised ranking, is indeed interesting. If you
>>>> want to incorporate it, feel free to discuss the plan.
>>>>
>>>>
>>>>> *Doubt2:*
>>>>>
>>>>> *Rank Aggregation:*
>>>>> Now that Xappian will have >1 Learning to rank algorithms, we should
>>>>> look into some kind of rank aggregation as well: combining outputs by
>>>>> various algorithms to get a final rank ordering for results. I went though
>>>>> a ECML'07 paper on unsupervised method for the same[link<http://l2r.cs.uiuc.edu/%7Edanr/Papers/KlementievRoSm07.pdf>].
>>>>> I haven't yet completely understood their approach but will do so by the
>>>>> end of day.
>>>>>
>>>>>
>>>> Rank Aggregation, is another LTR approach with a set of ranked lists at
>>>> hand for the query. At the moment Xapian can have 2 ranked list, BM25 and
>>>> SVM based LTR scheme. I think these techniques will produce better results
>>>> with the input of more number of ranked list than Xapian can offer at the
>>>> moment. But it would be interesting to explore after some more ranking
>>>> schemes incorporation.
>>>>
>>>>
>>>>>
>>>>> *Modularity:*
>>>>> Developing such modules in a modular fashion such that its not
>>>>> necessary to use all of them all the times, would be good. Whenever the
>>>>> user feels that in addition to basic features, he/she could use additional
>>>>> features, the feature extraction module could be plugged in. Same for rank
>>>>> aggregation.
>>>>>
>>>>
>>>> Agreed and that in fact, that will be the goal.
>>>>
>>>> Best,
>>>> Parth.
>>>>
>>>>
>>>>> *Relevant Background:*
>>>>> I have worked on few research oriented projects in Machine Learning,
>>>>> but most of them involved coding in Matlab/Java. More details about me: [
>>>>> link <http://www.rishabhmehrotra.com/index.htm>].
>>>>> I have been working on a project on Topic Modeling(using Latent
>>>>> Dirichlet Allocation) for Tweets. Link<http://code.google.com/p/tweettrends/>of the code on Google code. Also, I am involved in a collage project on
>>>>> building *focused crawler *& extending it to something like NELL<http://rtw.ml.cmu.edu/rtw/><far-fetched
>>>>> dream as of now :) >.[Google code link<http://code.google.com/p/bits-crawler/source/browse/>
>>>>> ]
>>>>>
>>>>> Please let me know how you feel about the above mentioned points
>>>>> [and/or if I am way off the track].
>>>>>
>>>>> Best,
>>>>> Rishabh.
>>>>>
>>>>> _______________________________________________
>>>>> Xapian-devel mailing list
>>>>> Xapian-devel at lists.xapian.org
>>>>> http://lists.xapian.org/mailman/listinfo/xapian-devel
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Rishabh.
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120404/39c33467/attachment-0001.htm>