KMeans - Evaluation Results

James Aylett james-xapian at
Fri Aug 19 09:58:28 BST 2016

On 18 Aug 2016, at 23:59, Richhiey Thomas <richhiey.thomas at> wrote:

> I've currently added a few classes which don't really belong to the public API (currently) into private headers and used PIMPL with the Cluster class.

I'm having difficulty reading your changes, because you aren't keeping to one complete change per commit. So for instance you've added a new Cluster::Internal class in one commit, but started using it in a different commit in which you also make a number of other changes. This makes it much harder to see whether there are problems in what you've done. (Similarly, you add new similarity classes in one commit, and in the jumbo commit remove them from where they came from and start using the new ones, possibly.)

You also still haven't integrated Clusterer fully; RoundRobin doesn't inherit from it. There may be other similar issues left unaddressed, but I don't have time right now to work through the jumbo commit and match it against the work in the other commits.

> The PR failure is because of the old tests which I had written for testing the old API. I'll have to write completely new tests because the API has changed dramatically after mid terms. I'll get that fixed soon by writing a few test cases for the newly implemented functionality.

It's also because won't compile under clang because of an equality / assignment error, and under gcc because there are warnings about members shadowed by local variables. You should have got one or the other while compiling yourself; what version or which compiler are you using?

> Currently, the main classes which have data to hide are Cluster, ClusterSet and Clusterer subclass (currently KMeans). Thus, if we can use PIMPL with these classes, it could hide quite a lot of the non-public data. As a plus point, I'm having problems with shifting PointType, Point and Centroid classes to a private header because of forward referencing problems, so even these problems can be solved if KMeans is to go with PIMPL.

As I think I've said before, you should concentrate on getting the PR building and in good shape before going further with PIMPL or similar. An API that has extra bits but works and is tested is much better than a tidier API that isn't feature complete.

You have actually tidied up the KMeans API considerably just by moving some internal methods out of public visibility. Although in future we may be able to hide them completely in a private implementation, it's neater now and more obvious how to use the class.

> According to me, it doesn't even affect the interface that the Clusterer provides because its just an interface. We can still hook up newer algorithms, which may not want to use PIMPL later. (I maybe wrong here)

Because Clusterer doesn't use PIMPL, users are free to subclass it freely. (And use PIMPL or not.) Olly's concern on IRC was that using PIMPL for KMeans would prevent subclassing that. At the moment that's probably moot, because it's not currently structured to make subclassing it practical anyway.

Even once you have a green PR that's ready to merge and are looking for further things to work on, I wouldn't worry too much about removing PointType, Point & Centroid from the public API. I have a feeling that when we look at further ways of initialising the KMeans centroids we'll want to make that a public API rather than a selector (mode="random", mode="kmeanspp") as it is now. That will involve exposing some part of that set of classes to make that API. (There are a number of ways of initialising them in the literature already, and it would be nice to allow users to experiment with their own.)


 James Aylett — —

More information about the Xapian-devel mailing list