<div dir="ltr"><font color="#000000">2014-03-11 8:47 GMT+08:00 Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span><span style="font-family:arial,sans-serif;font-size:14px"> wrote</span>:</font><div>

<font color="#000000"><br></font><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"></blockquote><div class="gmail_extra">

<span style="font-family:arial,sans-serif;font-size:14px">> Most applications of Xapian are interactive, so to actually be</span><br style="font-family:arial,sans-serif;font-size:14px"><span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">practically useful clustering needs to complete in a reasonable amount</span><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">of time (a fraction of a second ideally).  I think that needs to be a key</span><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">aim of the project.</span><br style="font-family:arial,sans-serif;font-size:14px"><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">If by "find new approaches" you mean different approach to that used by</span><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">the existing clustering branch, then sure.  If you're talking about</span><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">doing original research, I'd be a little cautious about that, as</span><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">clustering is a relatively mature field, and I'm a bit dubious a student</span><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">could development and implement a new approach in the GSoC timescale.</span><br style="font-family:arial,sans-serif;font-size:14px">

<br><span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">But if that aim is addressed, exactly what else the project consists of</span><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">is largely up to you.</span><br style="font-family:arial,sans-serif;font-size:14px"><br><br><br>

Thank you for your patient explanation about the project. My understanding about</div><div class="gmail_extra">the project "Clustering of Search Results" is that we mainly focus on processing </div><div class="gmail_extra">

speed of the existing code.</div><div class="gmail_extra"><br></div><div class="gmail_extra">By "find new approaches" I mean trying other known clustering algorithms. What I am </div><div class="gmail_extra">concerned is whether the low efficiency is caused by improper algorithm. I am reading</div>

<div class="gmail_extra">the existing clustering branch code and have not completely finished yet. I might be</div><div class="gmail_extra">able to talk more about existing code in my application of GSoC. But now, I really </div>

<div class="gmail_extra">can not comment before fully understanding exiting code.</div><div class="gmail_extra"><br></div><div class="gmail_extra"><br><br><br style="font-family:arial,sans-serif;font-size:14px"><span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">That's a good question - I'm not sure how clustering effectiveness is</span><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">typically measured.  But if we're implementing known approaches,</span><br style="font-family:arial,sans-serif;font-size:14px">

<span style="font-family:arial,sans-serif;font-size:14px">> </span><span style="font-family:arial,sans-serif;font-size:14px">a formal evaluation of effectiveness is probably less necessary.</span><br style="font-family:arial,sans-serif;font-size:14px">

<br><br><br>My idea about measure clustering effectiveness is that when we trying other known </div><div class="gmail_extra">clustering algorithms, we can use the old clustering result as a baseline. If the difference</div>

<div class="gmail_extra">of clustering results is acceptable and new clustering algorithm has high efficiency,</div><div class="gmail_extra">we may find a better approach. I will give more details about this in my application of GSoC.<br>

<br><br><br>Thanks</div><div class="gmail_extra">Liu Chi<br><br><br><div class="gmail_quote">2014-03-11 8:47 GMT+08:00 Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span>:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class="">On Mon, Mar 10, 2014 at 08:50:14PM +0800, Chi Liu wrote:<br>


> The topic of "Clustering of Search Results" looks interesting and I think<br>

> it suits me. I have been involved in a project that aims to clustering<br>

> tweets based on the text similarity and user profile. I noticed that<br>

> "Clustering of Search Results" have mentioned disappointing performance.I<br>

> am puzzled that is this project just concerned improve the performance of<br>

> the old code or also trying to find new approaches?<br>

<br>

</div>Most applications of Xapian are interactive, so to actually be<br>

practically useful clustering needs to complete in a reasonable amount<br>

of time (a fraction of a second ideally).  I think that needs to be a key<br>

aim of the project.<br>

<br>

But if that aim is addressed, exactly what else the project consists of<br>

is largely up to you.<br>

<br>

If by "find new approaches" you mean different approach to that used by<br>

the existing clustering branch, then sure.  If you're talking about<br>

doing original research, I'd be a little cautious about that, as<br>

clustering is a relatively mature field, and I'm a bit dubious a student<br>

could development and implement a new approach in the GSoC timescale.<br>

<div class=""><br>

> Besides clustering speed, how to evaluate clustering effect?<br>

<br>

</div>That's a good question - I'm not sure how clustering effectiveness is<br>

typically measured.  But if we're implementing known approaches,<br>

a formal evaluation of effectiveness is probably less necessary.<br>

<br>

Cheers,<br>

    Olly<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div>Chi Liu</div><div>+86-15210624786</div><div>Undergraduate Student</div><div>Team of Search Engine and Web Mining</div><div>School of Electronic Engineering  and Computer Science</div>

<div>Peking University, Beijing, 100871, P.R.China</div>

</div></div></div>