<div dir="ltr"><div class="gmail_default" style="color:rgb(11,83,148)">Some notes on the evaluation of clustering.<br><br></div><div class="gmail_default" style="color:rgb(11,83,148)">It might be difficult to get hold of an adhoc retrieval dataset with clustering metadata. So we should test it on clustering dataset.<br><br></div><div class="gmail_default" style="color:rgb(11,83,148)">There are a few publicly available, popular ones are <br>- 20-newsgroup: <a href="http://qwone.com/~jason/20Newsgroups/">http://qwone.com/~jason/20Newsgroups/</a><br></div><div class="gmail_default" style="color:rgb(11,83,148)">- Reuters 21578: <a href="https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection">https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection</a><br></div><div class="gmail_default" style="color:rgb(11,83,148)"><br></div><div class="gmail_default" style="color:rgb(11,83,148)">The evaluation framework can be to query each document and retrieve a set of documents from the corpus and measure Purity, NMI and Rand Index from the clusters.<br><br></div><div class="gmail_default" style="color:rgb(11,83,148)">P.<br></div><div class="gmail_default" style="color:rgb(11,83,148)"><br><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 19, 2016 at 2:28 PM, James Aylett <span dir="ltr"><<a href="mailto:james-xapian@tartarus.org" target="_blank">james-xapian@tartarus.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 18 Aug 2016, at 23:59, Richhiey Thomas <<a href="mailto:richhiey.thomas@gmail.com">richhiey.thomas@gmail.com</a>> wrote:<br>

<br>

> I've currently added a few classes which don't really belong to the public API (currently) into private headers and used PIMPL with the Cluster class.<br>

<br>

</span>I'm having difficulty reading your changes, because you aren't keeping to one complete change per commit. So for instance you've added a new Cluster::Internal class in one commit, but started using it in a different commit in which you also make a number of other changes. This makes it much harder to see whether there are problems in what you've done. (Similarly, you add new similarity classes in one commit, and in the jumbo commit remove them from where they came from and start using the new ones, possibly.)<br>

<br>

You also still haven't integrated Clusterer fully; RoundRobin doesn't inherit from it. There may be other similar issues left unaddressed, but I don't have time right now to work through the jumbo commit and match it against the work in the other commits.<br>

<span class=""><br>

> The PR failure is because of the old tests which I had written for testing the old API. I'll have to write completely new tests because the API has changed dramatically after mid terms. I'll get that fixed soon by writing a few test cases for the newly implemented functionality.<br>

<br>

</span>It's also because cluster.cc won't compile under clang because of an equality / assignment error, and under gcc because there are warnings about members shadowed by local variables. You should have got one or the other while compiling yourself; what version or which compiler are you using?<br>

<span class=""><br>

> Currently, the main classes which have data to hide are Cluster, ClusterSet and Clusterer subclass (currently KMeans). Thus, if we can use PIMPL with these classes, it could hide quite a lot of the non-public data. As a plus point, I'm having problems with shifting PointType, Point and Centroid classes to a private header because of forward referencing problems, so even these problems can be solved if KMeans is to go with PIMPL.<br>

<br>

</span>As I think I've said before, you should concentrate on getting the PR building and in good shape before going further with PIMPL or similar. An API that has extra bits but works and is tested is much better than a tidier API that isn't feature complete.<br>

<br>

You have actually tidied up the KMeans API considerably just by moving some internal methods out of public visibility. Although in future we may be able to hide them completely in a private implementation, it's neater now and more obvious how to use the class.<br>

<span class=""><br>

> According to me, it doesn't even affect the interface that the Clusterer provides because its just an interface. We can still hook up newer algorithms, which may not want to use PIMPL later. (I maybe wrong here)<br>

<br>

</span>Because Clusterer doesn't use PIMPL, users are free to subclass it freely. (And use PIMPL or not.) Olly's concern on IRC was that using PIMPL for KMeans would prevent subclassing that. At the moment that's probably moot, because it's not currently structured to make subclassing it practical anyway.<br>

<br>

Even once you have a green PR that's ready to merge and are looking for further things to work on, I wouldn't worry too much about removing PointType, Point & Centroid from the public API. I have a feeling that when we look at further ways of initialising the KMeans centroids we'll want to make that a public API rather than a selector (mode="random", mode="kmeanspp") as it is now. That will involve exposing some part of that set of classes to make that API. (There are a number of ways of initialising them in the literature already, and it would be nice to allow users to experiment with their own.)<br>

<span class="HOEnZb"><font color="#888888"><br>

J<br>

<br>

--<br>

 James Aylett<br>

 <a href="http://devfort.com" rel="noreferrer" target="_blank">devfort.com</a> — <a href="http://spacelog.org" rel="noreferrer" target="_blank">spacelog.org</a> — <a href="http://tartarus.org/james/" rel="noreferrer" target="_blank">tartarus.org/james/</a><br>

<br>

<br>

</font></span></blockquote></div><br></div>