<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Mar 10, 2014 at 3:59 PM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Exactly what approach the project takes isn't nailed down - it just<br>

seemed something which would be interesting for a student to work on,<br>

and would be useful to Xapian users.<br>

<br>

My understanding of the current clustering branch (which may not be<br>

completely accurate) is that it clusters based on a pair-wise measure<br>

of document similarity, and that the user can specify which terms from<br>

the documents are used.  I think you'd consider more than just the words<br>

in the query - in a typical case, the query is short and the top N<br>

documents will match all the words in it.<br></blockquote><div><br></div><div>So, what you are saying is that we need to<br> 1. Assign similar objects to the same subset<br> 2. Assign dissimilar objects to different subsets<br>

</div><div>that is, we are trying to make disjoint subsets.<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

It's an open question whether the project should be based on the<br>

existing code or not, but I think it should at least attempt to learn<br>

from the existing code - it would be a real shame to spend 3 months<br>

working on this only to end up with two different clustering<br>

implementations, neither of which is usable on larger sets of documents.<br>

<br>

I think the clustering would probably be based on the terms in the<br>

documents (I can't really think what else it would be based on).<br>

Possibly using Xapian's query expansion feature (Enquire::get_eset()) to<br>

generate a more restricted list of "interesting" terms to consider would<br>

help.<br></blockquote><div><br></div><div>Yes, what I know is that clustering will be based on number of clusters i.e. no. of disjoints sets for that particular document.<br><br></div><div>Enquire:::get_eset() will return the expand set of related documents. The corresponding "Xapian::ExpandDecider * edecider " will decide which document has to be inserted in the expand set.<br>

</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div> 

<br></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="">

</div>That's related to clustering, but it isn't completely equivalent.<br>

<br>

As an example, one way you could generate clusters is to think of each<br>

document as a point in a multi-dimensional space, where each dimension<br>

represents a different term with the distance in that direction being<br>

something like (within_document_frequency / document_length).  In this<br>

space, the distance between two identical documents is 0, and documents<br>

which are more different will tend to be further apart (one word<br>

changed is a small distance; no words in common is a long way apart).<br>

<br>

Clustering is then splitting the documents into groups which are near<br>

each other in that space.<br></blockquote><div><br></div><div>So, here indirectly you are talking about is the vector space model where we will measure the relatedness between the sets on the basis of their Euclidean distances. Thus, using k-means algorithm ?<br>

<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Set expansion would mean picking some seed documents to start the sets,<br>

and then going through the remaining documents adding them to the<br>

"nearest" set (by some measure).  These sets are really just the same<br>

as clusters (at least if each document belongs to exactly one cluster)<br>

so this is a way to get you a fixed number of clusters, but this is not<br>

the only way to generate a fixed number of clusters, and not all<br>

clustering starts out looking for a fixed number of clusters.<br></blockquote><div><br></div><div>Can you please tell me how should I proceed? <br></div><div>What should I do to start with the project ?<br><br></div><div>

Thanks,<br>Saksham<br> <br></div><div>PS: Is the project-related to speed up the existing code, that is, to make it work faster or something else, along with some good merging algorithm to merge the results.<br></div></div>

</div></div>