[Xapian-devel] GSoC 2014: Clustering of Search Results

Sat Mar 22 01:59:46 GMT 2014

On Fri, Mar 21, 2014 at 01:37:31AM +0530, Abhishek Gupta wrote:
> 1) Hierarchical clustering have high memory requirements *O(n*n)* in
> comparison to *O(n+K)* space complexity of K-means algorithm, where*n* is
> the number of elements and* K *is the number of clusters.
> 2) Hierarchical clustering running time is *O(n*n*n)* on the other hand
> K-means algorithm has time complexity of *O(n)*.

Are you sure about those complexities?

(Hint: at least one is wrong...)

Also, O() is the asymptotic behaviour, but what we really care about is
it taking well under a second to cluster a limited number of results.
Probably 10000 at most, and I suspect more like 1000.  Bear in mind
that:

    K2 * n * n < K1 * n  for small n

What is "small" depends on K1 and K2 of course.

> I read your existing source code for the clustering which is quite slow
> because of the hierarchical based clustering which is not required at all.*You
> have already provided with the number of clusters you should have in the
> end*. So for this we can employ K-means algorithm which can perform far
> better than the current algorithm.

I'd say it is a flaw in the current API that you are expected to provide
the number of clusters you want in advance.  How can you know how many
natural clusters there will be?

The current algorithm is at least easily amenable to producing a
variable number of clusters (just make the condition to stop linking
related to the length the next link would need to be).

> One thing that K-means lacks is its non-deterministic outcome. Every time
> it will produce different clusters. But we can always run the algorithm
> 10-12 times and then take the average even then it will perform far better
> than the hierarchical one.

Having to run the algorithm repeatedly and then compare the outcomes is
going to rather work against fast clustering.  And what do you mean by
the "average" of 12 different partitions?

Non-deterministic isn't ideal, but is probably tolerable.  But somehow
picking the best of 12 non-deterministic results isn't magically going
to be deterministic anyway.

I'm not saying K-means isn't an option, but the case you've made for
it so far isn't particularly convincing.

Cheers,
    Olly