KMeans - Evaluation Results

James Aylett james-xapian at tartarus.org
Tue Aug 16 15:19:37 BST 2016


On 15 Aug 2016, at 22:05, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:

> I have tested this implementation of KMeans with a BBC news article dataset.

You mentioned FIRE as another dataset to test with, but I don't think it has any pre-clustered datasets. There may be something helpful in another dataset. However if the BBC dataset can be fairly trivially clustered manually (which you suggested it could) then that could be used to compute some external evaluation measures. I don't know much about evaluating clustering, but maybe Fowlkes-Mallows would be a good starting point?

There are also some internal evaluations that might be worth looking at, like the silhouette coefficient.

In all cases, you need to compare against something for the numbers to mean much. It's probably worth setting up a Google spreadsheet or similar to track these: then you can build up different coefficients calculated for KMeans, KM++ seeded KMeans, and round robin clustering to see how they behave.

(This becomes important in future when looking at clustering or pre-clustering approaches that have threshold parameters, so we can evaluate performance as we tweak the defaults.)

> When dealing with an MSet that contains around 200 - 300 documents, clustering is efficient and usually gives sub-optimal clustering solutions (average distance between document and centroid = 0.65 - 0.75 in cosine similarity). For MSet's containing more documents, the time seems to increase steadily.

How long does 200–300 documents take to cluster? How does it grow as more documents are included in the MSet? We'd expect an MSet of 1000 documents to take longer to cluster than one with 100, but the important thing is _how_ the time increases as the number of documents grows.

> KMeans++ results:
> KMeans++ provides slightly superior results due to spreading the cluster centroids out. But this becomes a problem when the document vectors are not spread uniformly and concentrate in a few places, because the cluster sizes end up being out of proportion.

Surely that's the right behaviour for that kind of data? (Although AIUI KMeans is supposed to be that good in that situation: is that what you mean?)

> With respect to optimization, I'm currently using unordered_map where ever I'm requiring to map values. It certainly works faster than the std::map but there are various hashmap implementations that can work faster. Google dense hash map is one of those and they work way faster than unordered_map (to my knowledge). I thought of using them to speed value look ups, but would it affect the portability of the code? Or will it be possible for me to use it?

There are a number of considerations in using other code or libraries:

1. What's the license? If it's incompatible with GPL, then you can't use it, period. Also, if it's code that we want to copy into the source tree (rather than a library), under a license that is compatible with the GPL but can't be relicensed as MIT/X license, that's also a problem because of the licensing direction we want to go in (https://github.com/xapian/xapian/blob/master/xapian-core/HACKING#L1427).

2. If the idea is to copy code in, then we need to ensure that we respect copyright notices as well as the license. We also generally don't want to copy code in that is under active development, or we'll constantly be updating our copy of it, which is both inefficient and inelegant.

3. If it's a library, then it needs to be easy to install on a wide range of systems. If not, it may be worth considering whether we can optionally use it, so if it's not installed we fall back to a slower version backed by the STL.

4. In all cases, there's a possible portability concern. Some projects will give a good idea of their portability, either by stating their aims in that respect, or with some sort of test matrix.

Before you dive in, I'd also ask yourself explicitly what problem you are trying to solve. There may be other ways of thinking about it that lead you to different data structures or algorithms for working with them. It's not always best to continually look for more efficient implementations of the same data structure.

For instance, even with sparse vectors, it may be that compact storage is more CPU efficient without taking too much memory. Maybe there's some dimensional reduction work that can be done before or while creating the document vectors (eliminating non-discriminative terms, or preprocessing the database to find terms that are close to linearly-dependent, for instance; there are a wealth of options to look at here, though — I know that a previous GSoC project on clustering looked at dimensionality reduction in optimising the previous clustering branch).

> I would like to know what documentation I would need to write for the current clustering API so that I can start with that too,

The classes need documenting in header files, on which you're partway there: at least every non-private member should be fully documented, and every class given an overall documentation comment that explains its usage.

Anything which isn't part of the public API should not be in the external header file, but still needs documenting so those working on the library in future know what they're doing.

Then we need an introduction to using the clustering system in the user guide (https://getting-started-with-xapian.readthedocs.io/en/latest/). That should be a new document in the "how to" part of the guide.

> Also, since the final week for GSoC is starting now, should I list the API that was merged into a different branch before as merged or as yet to merge? And incase I am not able to complete PSO by 23rd august since I am behind the timeline, would it be possible for me to continue it later on?

You should point to the work that was merged, and additionally to any pull requests on top of that. As we merge those pull requests, it will still be obvious to people used to using github what's going on.

I think in practice you should concentrate on everything other than PSO: finishing the optimisations, writing the documentation, getting the PR in the best possible shape ready for merge, and writing up the evaluations. You can of course continue to work on PSO (or anything else!) after GSoC is ended. We're always happy when former GSoC students stay a part of Xapian, and it's one of our goals of doing it :-)

J




More information about the Xapian-devel mailing list