[Xapian-devel] [GSoC2012] Learning to Rank: few thoughts/issues

Wed Apr 4 09:49:10 BST 2012

Thanks Parth for the feedback. Really appreciate your inputs. :)
I will prepare my proposal and wait for your(& others') inputs.

Thanks,
Rishabh.

On Wed, Apr 4, 2012 at 2:08 PM, Parth Gupta <parthg.88 at gmail.com> wrote:

> Hi Rishabh,
>
>
>> To sum up, in addition to the a subset of 134 of MSR's LETOR features,
>> we'll use the unsupervised deep learning features to train our model based
>> on ListMLE algorithm. As each of this would be implemented in a modular
>> fashion, if we get bad results(which shouldn't happen), we'll eliminate
>> that module(deep learning one) and keep the remaining; if we get better
>> results(which we should), we'll have a research contribution to make to the
>> community because Deep Learning techniques haven't been applied to ranking
>> systems yet.
>>
>
> Okay, this seems feasible that, deep learning module is optional. That is
> sane to see the ListMLE as the least outcome of the project. Looking at the
> application deadline, I would like to see all the details in a proper
> application format and we can discuss on top of that.
>
> Best,
> Parth.
>
>>
>> Regards,
>> Rishabh.
>>
>>
>> On Wed, Apr 4, 2012 at 1:19 PM, Parth Gupta <parthg.88 at gmail.com> wrote:
>>
>>> Hi Rishabh,
>>>
>>> If I have understood correctly ( thought I am not sure), do you want to
>>> extract features using some of the deep learning techniques in the
>>> semi-supervised setting of Letor. Can you give me some idea about what
>>> these features are?
>>>
>>> Are you planning to implement ListMLE in addition to this?
>>>
>>> Parth.
>>>
>>>
>>> On Mon, Apr 2, 2012 at 10:03 PM, Rishabh Mehrotra <erishabh at gmail.com>wrote:
>>>
>>>> Thanks Parth for your inputs. I had gone through the 134 or so feature
>>>> list at MSR's LETOR site for the ranking algorithms. The traditional IR
>>>> features I was talking about in my previous mail were referring to those
>>>> ones.
>>>>
>>>> Thanks for the heads-up on the unsupervised features part. I will be
>>>> more inclined on using one of the recently famous *deep learning
>>>> architectures* for unsupervised feature extraction simply because they
>>>> have been shown to outperform hand-crafted feature based state-of-the-art
>>>> algorithms. PFA a snapshot<Deep Learning.png> from one of the tutorials by
>>>> Andrew Ng highlighting the same fact.
>>>>
>>>> I would like to discuss about the probable methodology regarding the
>>>> same.
>>>>
>>>> *Methodology:*
>>>> In order to do unsupervised feature extraction from the documents, I'll
>>>> use Denoising Stacked Autoencoders. Its a 2 stage process:
>>>>
>>>>    - Unsupervised pre-training
>>>>    - Supervised fine tuning
>>>>
>>>> *Unsupervised pre-training:*
>>>>
>>>> The denoising autoencoders[link<http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity>]
>>>> can be stacked to form a deep network by feeding the latent representation
>>>> (output code) of the denoising auto-encoder found on the layer below as
>>>> input to the current layer. The *unsupervised pre-training* of such an
>>>> architecture is done one layer at a time. Each layer is trained as a
>>>> denoising auto-encoder by minimizing the reconstruction of its input (which
>>>> is the output code of the previous layer). Once the first *n* layers
>>>> are trained, we can train the *n+1-th* layer because we can now
>>>> compute the code or latent representation from the layer below.
>>>> The kind of representation learnt by this step is a rich representation
>>>> which has shown to be better than hand-crafted features/representations.
>>>>
>>>>
>>>> *Supervised fine tuning:*
>>>>
>>>> Once all layers are pre-trained, the network goes through a second
>>>> stage of training called fine-tuning. Generally we consider supervised
>>>> fine-tuning where we want to minimize prediction error on a supervised
>>>> task.
>>>> *For our problem statement(ranking):* we would use the features
>>>> generated in the unsupervised pre-training stage and add the previously
>>>> used IR features(a subset of the 136 LETOR features) and feed this to
>>>> ListMLE or other Learning to Rank algorithm.
>>>>
>>>> To sum up, we would perform unsupervised pre-training to get features
>>>> for all the documents(unlabeled as well as labeled) and then proceed to do
>>>> supervised fine tuning only on the labeled data/documents. This way we
>>>> would use both labelled and unlabeled data to learn features to represent
>>>> documents. I have worked a bit on a related project, so the integration of
>>>> such a combination (Unsupervised feature extraction+ListMLE) into xapian
>>>> shouldn't be a tough asking(hopefully) as a GSoC project.
>>>>
>>>>
>>>> Your inputs are welcome. Thanks for your time.
>>>>
>>>> -
>>>> Rishabh.
>>>>
>>>> PS: A lot of hot-research is going on in the Deep Learning field.
>>>> Here's a link <http://deeplearningworkshopnips2011.wordpress.com/> to
>>>> a NIPS'11(Tier-1 in ML) workshop on the same for everyone's reference.
>>>>
>>>>
>>>> On Mon, Apr 2, 2012 at 3:52 PM, Parth Gupta <parthg.88 at gmail.com>wrote:
>>>>
>>>>> Hello Rishabh,
>>>>>
>>>>> Good to hear from you. Its never late to jump-in for GSoC.
>>>>>
>>>>>  *Doubt1:*
>>>>>>
>>>>>> *Feature Extraction/Selection:*
>>>>>> The various datasets listed on MSR's LETOR have a limited set of
>>>>>> features. Current implementation in xapian's LETOR has 5
>>>>>> features[tf,idf,doc_len,coll_tf,coll_len]. While algorithms for learning
>>>>>> ranking models have been intensively studied, this is not the case for
>>>>>> feature selection, despite of its importance. In a paper presented at
>>>>>> SIGIR'07 [Tier1 in IR domain], the authors have highlighted the
>>>>>> effectiveness of feature selection methods for ranking tasks.[link<http://research.microsoft.com/en-us/people/tyliu/fsr.pdf>]
>>>>>> I believe that apart from the traditional/cliched IR features, we should
>>>>>> * incorporate new features* to improve the performance of the LETOR
>>>>>> module.
>>>>>>
>>>>>>
>>>>> There is no point denying the fact that there is a need for more
>>>>> features. If you have noticed on the GSoC idea page of Letor, it says "The
>>>>> project can also include some work on the features, like adding support for
>>>>> more features, selecting a subset of features, etc." Now the point comes,
>>>>> which features you want to incorporate. The Letor datasets are growing
>>>>> enormously in terms of number of features [Letor MSR 46 ->136 , Yahoo
>>>>> Dataset 700]. It would make sense to incorporate those features which can
>>>>> be tracked and suits the environment. More over majority of the features
>>>>> dwell around the IR measures like bm25, TF, IDF, LM and different
>>>>> combination of them for different part of the document. Some of the other
>>>>> features of Letor Datasets are number of outgoing links, number of incoming
>>>>> links, page rank, number of children [1,2]. These features are valid and
>>>>> available only in the linked data and moreover, straight forward to
>>>>> compute. Yahoo dataset does not even declare the features because of the
>>>>> proprietary issues. But I think it also includes some features like the age
>>>>> of the page, number of clicks on it, total time spent, and so on.
>>>>>
>>>>> [1]
>>>>> http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/Features_in_LETOR4.pdf
>>>>> [2] http://research.microsoft.com/en-us/projects/mslr/feature.aspx
>>>>>
>>>>>>
>>>>>> *Using unlabeled data:*
>>>>>> Over the last 3-4 years a lot of papers have identified the
>>>>>> importance of using unlabeled data to assist the task at hand by using it
>>>>>> during feature extraction stage. Andrew Ng proposed a Self-Taught learning
>>>>>> framework[ICML'07 paper<http://ai.stanford.edu/%7Ehllee/icml07-selftaughtlearning.pdf>]
>>>>>> wherein they make use of unlabeled data to improve performance. A very
>>>>>> recent paper at ICML'11<http://eprints.pascal-network.org/archive/00008597/01/342_icmlpaper.pdf>used the advantage of feature learning using unlabeled data and beat the
>>>>>> state-of-the-art in sentiment classification.
>>>>>>
>>>>>> Combining the above two points, I suggest an approach which uses
>>>>>> features learnt from data in an unsupervised fashion "*in addition to
>>>>>> *" the commonly used features.
>>>>>> *Please note:* all this is in addition to the traditional features
>>>>>> and finally we would be using *listwise/pairwise approaches*[ListMLE,
>>>>>> et cetera] to train our models on the new set of features. Please let me
>>>>>> know if this sounds good.
>>>>>>
>>>>>>
>>>>> This phenomenon, Semi-supervised ranking, is indeed interesting. If
>>>>> you want to incorporate it, feel free to discuss the plan.
>>>>>
>>>>>
>>>>>> *Doubt2:*
>>>>>>
>>>>>> *Rank Aggregation:*
>>>>>> Now that Xappian will have >1 Learning to rank algorithms, we should
>>>>>> look into some kind of rank aggregation as well: combining outputs by
>>>>>> various algorithms to get a final rank ordering for results. I went though
>>>>>> a ECML'07 paper on unsupervised method for the same[link<http://l2r.cs.uiuc.edu/%7Edanr/Papers/KlementievRoSm07.pdf>].
>>>>>> I haven't yet completely understood their approach but will do so by the
>>>>>> end of day.
>>>>>>
>>>>>>
>>>>> Rank Aggregation, is another LTR approach with a set of ranked lists
>>>>> at hand for the query. At the moment Xapian can have 2 ranked list, BM25
>>>>> and SVM based LTR scheme. I think these techniques will produce better
>>>>> results with the input of more number of ranked list than Xapian can offer
>>>>> at the moment. But it would be interesting to explore after some more
>>>>> ranking schemes incorporation.
>>>>>
>>>>>
>>>>>>
>>>>>> *Modularity:*
>>>>>> Developing such modules in a modular fashion such that its not
>>>>>> necessary to use all of them all the times, would be good. Whenever the
>>>>>> user feels that in addition to basic features, he/she could use additional
>>>>>> features, the feature extraction module could be plugged in. Same for rank
>>>>>> aggregation.
>>>>>>
>>>>>
>>>>> Agreed and that in fact, that will be the goal.
>>>>>
>>>>> Best,
>>>>> Parth.
>>>>>
>>>>>
>>>>>> *Relevant Background:*
>>>>>> I have worked on few research oriented projects in Machine Learning,
>>>>>> but most of them involved coding in Matlab/Java. More details about me: [
>>>>>> link <http://www.rishabhmehrotra.com/index.htm>].
>>>>>> I have been working on a project on Topic Modeling(using Latent
>>>>>> Dirichlet Allocation) for Tweets. Link<http://code.google.com/p/tweettrends/>of the code on Google code. Also, I am involved in a collage project on
>>>>>> building *focused crawler *& extending it to something like NELL<http://rtw.ml.cmu.edu/rtw/><far-fetched
>>>>>> dream as of now :) >.[Google code link<http://code.google.com/p/bits-crawler/source/browse/>
>>>>>> ]
>>>>>>
>>>>>> Please let me know how you feel about the above mentioned points
>>>>>> [and/or if I am way off the track].
>>>>>>
>>>>>> Best,
>>>>>> Rishabh.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xapian-devel mailing list
>>>>>> Xapian-devel at lists.xapian.org
>>>>>> http://lists.xapian.org/mailman/listinfo/xapian-devel
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Rishabh.
>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120404/82d64e27/attachment.htm>