<div dir="ltr">Hello all,<div><br></div><div>I have submitted the final proposal for this project.</div><div>Here's the link:</div><div><br></div><div><a href="https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/sam_1993/5707702298738688?validated=True#" target="_blank">https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/sam_1993/5707702298738688?validated=True#</a><br>


</div><div><br></div><div>I would like to hear what's missing in this, so that I can do the necessary changes. It would really be a great help.</div><div><br></div><div>Thanks,</div><div>Saksham</div></div><div class="gmail_extra">

<br><br><div class="gmail_quote">On Tue, Mar 11, 2014 at 6:31 PM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="">On Mon, Mar 10, 2014 at 07:37:30PM +0530, Saksham Maheshwari wrote:<br>

> On Mon, Mar 10, 2014 at 3:59 PM, Olly Betts <<a href="mailto:olly@survex.com">olly@survex.com</a>> wrote:<br>

><br>

> > Exactly what approach the project takes isn't nailed down - it just<br>

> > seemed something which would be interesting for a student to work on,<br>

> > and would be useful to Xapian users.<br>

> ><br>

> > My understanding of the current clustering branch (which may not be<br>

> > completely accurate) is that it clusters based on a pair-wise measure<br>

> > of document similarity, and that the user can specify which terms from<br>

> > the documents are used.  I think you'd consider more than just the words<br>

> > in the query - in a typical case, the query is short and the top N<br>

> > documents will match all the words in it.<br>

><br>

> So, what you are saying is that we need to<br>

>  1. Assign similar objects to the same subset<br>

>  2. Assign dissimilar objects to different subsets<br>

> that is, we are trying to make disjoint subsets.<br>

<br>

</div>I think so.  While there are clustering techniques where an item can be<br>

in more than one cluster, or in which items have a degree of belonging<br>

to each cluster, I'd say a strict partition into disjoint subsets is<br>

likely to be most useful for presenting search results in easier to<br>

comprehend ways.<br>

<div class=""><br>

> > It's an open question whether the project should be based on the<br>

> > existing code or not, but I think it should at least attempt to learn<br>

> > from the existing code - it would be a real shame to spend 3 months<br>

> > working on this only to end up with two different clustering<br>

> > implementations, neither of which is usable on larger sets of documents.<br>

> ><br>

> > I think the clustering would probably be based on the terms in the<br>

> > documents (I can't really think what else it would be based on).<br>

> > Possibly using Xapian's query expansion feature (Enquire::get_eset()) to<br>

> > generate a more restricted list of "interesting" terms to consider would<br>

> > help.<br>

><br>

> Yes, what I know is that clustering will be based on number of clusters<br>

> i.e. no. of disjoints sets for that particular document.<br>

<br>

</div>I'm not sure I follow what you're saying here.  Isn't each particular<br>

document going to be in exactly one set?<br>

<div class=""><br>

> Enquire:::get_eset() will return the expand set of related documents. The<br>

> corresponding "Xapian::ExpandDecider * edecider " will decide which<br>

> document has to be inserted in the expand set.<br>

<br>

</div>No, get_eset() returns a set of *TERMS*, not documents.  It works<br>

based on a set of documents marked as relevant (an RSet).<br>

<br>

And an ExpandDecider decides which *TERMS* to allow into the expand set.<br>

<br>

What I was suggesting is that you could potentially generate an ESet and<br>

use that as a restricted list of terms to consider.  But generating an<br>

ESet does takes a little time in itself.  Perhaps it would be better to<br>

just pick the terms with the highest "discriminating power" (really<br>

common terms aren't interesting, but neither are the really rare ones -<br>

at the extremes, a term is no help to clustering if it is in all<br>

documents or if it is in only one).<br>

<div class=""><br>

> So, here indirectly you are talking about is the vector space model where<br>

> we will measure the relatedness between the sets on the basis of their<br>

> Euclidean distances. Thus, using k-means algorithm ?<br>

<br>

</div>It was just an example which I hoped would help you visualise what I was<br>

talking about.  I'm not trying to say the clustering should be done<br>

based on a vector space model.<br>

<br>

Using k-means clustering is an option, though it has the drawback that<br>

you need to decide how many clusters you want first, but there might be<br>

a different number of natural clusters in the data.  It would be more<br>

satisfactory to be able to say you want (say) "about 7 clusters", or<br>

"between 6 and 10 clusters".  Running the k-means clustering multiple<br>

times to try different numbers of clusters probably isn't feasible for<br>

our applications.<br>

<div class=""><br>

> Can you please tell me how should I proceed?  What should I do to<br>

> start with the project ?<br>

<br>

</div>If you've not already done so, the first step is really to check out the<br>

Xapian code and get it to build.<br>

<br>

If/once you've done that, I'd suggest looking through the existing<br>

clustering code and working out what it is doing.  If you're familiar<br>

with profiling tools, then you could even try profiling it to see<br>

where the time is spent and what might be causing it to scale poorly.<br>

<div class=""><br>

> PS: Is the project-related to speed up the existing code, that is, to make<br>

> it work faster or something else,<br>

<br>

</div>The important goal is to have a clustering feature which works at a<br>

usable speed.<br>

<br>

That could mean taking the existing clustering code and making it work<br>

faster, or it could mean replacing it with new code.  If we start<br>

again, I think it would be prudent to at least try to understand why<br>

the current code doesn't scale as we'd like.  If we don't have some<br>

idea why the previous attempt didn't work out, we risk repeating the<br>

same mistakes.<br>

<div class=""><br>

> along with some good merging algorithm to merge the results.<br>

<br>

</div>I'm confused - merging what results?<br>

<br>

Cheers,<br>

    Olly<br>

</blockquote></div><br></div>