Your comment on training data from &#39;click-through&#39; certainly makes sense. But I am not quite sure what sources can we use to produce such data as of now but u can suggest. Main idea would be make infrastructure for the learning and learn the ranking function from the currently available training data and as and when better data comes or is generated we can re-learn the ranking function.<br>

<br>If some of our features are time dependent then anyway we have to do the training at some interval. So it should not be a problem to incorporate changes in order to facilitate new and better training data to the algorithm.<br>

<br>Okay So I submit the application in keeping these feedbacks in consideration and after reviewing the entire application if something needs more clarification of course within the deadline then please let me know.<br><br>

Regards,<br>Parth.<br><br><br><br><div class="gmail_quote">On Mon, Apr 4, 2011 at 8:16 AM, Olly Betts <span dir="ltr">&lt;<a href="mailto:olly@survex.com">olly@survex.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im">On Sun, Apr 03, 2011 at 08:57:36PM +0530, Parth Gupta wrote:<br>

&gt; Click-through measurements are certainly good measure for automatic<br>

&gt; preparation of training data. But what I have in my mind is if we consider<br>

&gt; relevance as a binary variable then For the training data there are many<br>

&gt; relevance judgements are available for ad-hoc retrieval task in many good IR<br>

&gt; conferences like TREC or FIRE, so we can prepare the feature vectors from<br>

&gt; them. It will be a first benchmark for the project guideline. It will be<br>

&gt; reliable too because it is human-judged and comprises both the relevant and<br>

&gt; non-relevant documents. So an unbiased sample and good for machine learning.<br>

<br>

</div>That&#39;s OK for developing this.  But it seems likely that training in one<br>

domain won&#39;t transfer reliably to another, so someone developing a<br>

search which uses this will really need their own training data.<br>

<br>

So for a developer wanting to deploy this, being able to automatically<br>

crowd-source my training data by tracking clicks on search results is<br>

much more appealing than having to invest time and/or money in getting<br>

relevance judgement produced specially.  Click data also allows training<br>

to be a more continuous process, which is beneficial for sites where<br>

topics evolve fairly quickly with time (like news sites).<br>

<br>

The click data is almost certainly going to be noisier, which might be<br>

an issue for training, but for a busy site you can easily produce much<br>

more of it than you can with explicit relevance judgements, so perhaps<br>

the noise can be filtered out if it is an issue.<br>

<div class="im"><br>

&gt; Also I am very new to the formalities to submit the application for the GSoC<br>

&gt; so if the things happen early then I would have enough time to shape the<br>

&gt; application considering feedbacks.<br>

<br>

</div>The formalities are that you need to file an application here before<br>

1900UTC on April 8th:<br>

<br>

<a href="http://socghop.appspot.com/gsoc/org/google/gsoc2011/xapian" target="_blank">http://socghop.appspot.com/gsoc/org/google/gsoc2011/xapian</a><br>

<br>

But it&#39;s a good idea to get your application in sooner than that to<br>

give us a chance to review it and make comments.  There&#39;s also likely to<br>

be a surge in proposals as the deadline nears.  You&#39;re able to make<br>

changes up until the deadline.<br>

<br>

Cheers,<br>

<font color="#888888">    Olly<br>

</font></blockquote></div><br>