[Xapian-devel] The workflow of letor module

Parth Gupta pargup8 at gmail.com
Tue Jun 10 08:43:20 BST 2014


Hi Hanxiao,

Its better to discuss using a public list or IRC because that way we can
also involve more people into discussion.

I would consider the developers who wish to use Xapian for supporting their
own projects also as users and they would definitely want to tune
parameters if they have labelled training/development file. Metric module
will be integral part of the training module and may not be part of public
API but it can be decided later.

Cheers,
Parth.



On Mon, Jun 9, 2014 at 3:00 PM, Hanxiao Sun <sunhanxiaoisme at gmail.com>
wrote:

> Hi~ Parth,
>
> Do you mean we just set the default parameters for user and letor module
> is transparent to user?
>
> In this situation, the metric module is also useless to user. We just use
> it in development phase.
>
> And one more question, should I reply the E-mail to the xapian community
> or just reply it to you?
>
>
> Thanks,
> Hanxiao.
>
>
> 2014-06-08 14:41 GMT+08:00 Parth Gupta <pargup8 at gmail.com>:
>
> Hi Hanxiao,
>>
>> Its quite easy to handle. Most of the rankers have a couple of parameters
>> to tune. During the train method we supply a the possible range of the
>> parameters for a sweep and we select the parameters which perform the best
>> on the development set aka validation set. For example, parameter C of
>> RankSVM has a range
>>
>> double[] paramC = {0.0001, 0.0005,
>>                                 0.001, 0.002, 0.005, 0.01, 0.05, 0.1,
>> 0.2, 0.5, 1, 2, 5, 10};
>>
>>
>> Cheers,
>> Parth.
>>
>>
>> On Thu, Jun 5, 2014 at 9:56 AM, Hanxiao Sun <sunhanxiaoisme at gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am Hanxiao Sun, the student working on the letor module with Jiarong
>>> in this year's gsoc.
>>>
>>> Now, I have some idea about xapian-letor wanted to be discuss with
>>> the community.
>>>
>>> In the present code, we prepare a training file before we use the letor
>>> module. When we have a query, we train the SVM module using the default
>>> parameters. And then we use the Mset return by the query as the test set ——
>>> using the trained SVM module to predict the ranking of the Mset. To be
>>> clearly, we have no label(ground truth) in the test set. So, we
>>> couldn't(maybe also no need) evaluate the ranking result in the current
>>> workflow.
>>>
>>> For normal user, the current workflow is OK(although the problem that
>>> how to obtain the training file from these user has still not been solved).
>>> They don't care the special parameters of each ranking module and they just
>>> want a best possible ranking result.
>>>
>>> But for other user, like the user who want to tune the parameters and
>>> add feature into the ranking module, they also want to evaluate the ranking
>>> result in the test set. In other word, they will have the ground truth in
>>> their test set and need to use the metric module in the test process.
>>>
>>> The difference between these two part of users is that we will call the
>>> metric module to evaluate the ranking result if the test set has ground
>>> truth, otherwise don't. This involves a issue that whether we needed a
>>> independent script to call the metric module outside the "questletor"? If
>>> we don't peel the evaluate process from "questletor", we need the user to
>>> choose the mode they use "questletor". Has ground truth or not. But if we
>>> peel the process from "questletor", the user will have little trouble when
>>> they want to do a k-fold cross validation. They need split the data
>>> by themselves and run the "questletor" and evaluate script K times.
>>>
>>> Not sure if I am understanding this right and this seems to be the
>>> issue more relevant to Jiarong's part. However, I still want to make it
>>> clear. Any comments and suggestions will be appreciative.
>>>
>>> Thanks!
>>> --
>>> 孙晗晓(Hanxiao Sun)
>>> Master Student of Computer Science at Institute of Computing
>>> Technology,Chinese Academy of Sciences(ICT)
>>> Email:sunhanxiaoisme at gmail.com <Email%3Asunhanxiaoisme at gmail.com>
>>> Mobile: (86)186-0025-6936
>>>
>>> ------------------------------
>>> This email (including any attachments) is confidential and may be
>>> legally privileged. If you received this email in error, please delete it
>>> immediately and do not copy it or use it for any purpose or disclose its
>>> contents to any other person. Thank you.
>>>
>>>
>>> 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
>>>
>>> _______________________________________________
>>> Xapian-devel mailing list
>>> Xapian-devel at lists.xapian.org
>>> http://lists.xapian.org/mailman/listinfo/xapian-devel
>>>
>>>
>>
>
>
> --
> 孙晗晓(Hansel Sun)
> Master Student of Computer Science at Institute of Computing
> Technology,Chinese Academy of Sciences(ICT)
> Email:sunhanxiaoisme at gmail.com <Email%3Asunhanxiaoisme at gmail.com>
> Mobile: (86)186-0025-6936
>
> ------------------------------
> This email (including any attachments) is confidential and may be legally
> privileged. If you received this email in error, please delete it
> immediately and do not copy it or use it for any purpose or disclose its
> contents to any other person. Thank you.
>
>
> 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140610/d866988e/attachment.html>


More information about the Xapian-devel mailing list