KMeans - Evaluation Results

Mon Aug 15 22:05:52 BST 2016

Hello,

I've recently finished with an implementation of KMeans with two
initialization techniques, random initialization and KMeans++. I would like
to share my findings after evaluating the same.

I have tested this implementation of KMeans with a BBC news article
dataset. I am currently working on evaluating the same with FIRE datasets.
Currently, clustering more than 500 documents efficiently is taking time (a
little more than 5 seconds) and I'm trying to optimize for time by finding
methods for early convergence or simple code profiling.

The BBC news article dataset contains articles related to politics, sports,
tech, entertainment and business and contains 2225 documents in all.

Random KMeans results:
When dealing with an MSet that contains around 200 - 300 documents,
clustering is efficient and usually gives sub-optimal clustering solutions
(average distance between document and centroid = 0.65 - 0.75 in cosine
similarity). For MSet's containing more documents, the time seems to
increase steadily.

KMeans++ results:
KMeans++ provides slightly superior results due to spreading the cluster
centroids out. But this becomes a problem when the document vectors are not
spread uniformly and concentrate in a few places, because the cluster sizes
end up being out of proportion.
The average distance between document and centroid is lesser than that of
KMeans mostly (around 0.55-0.7). Thus, in many cases, the clusters end up
being more compact.

I'm continuing to optimize this solution in time before I can go ahead with
PSO.

With respect to optimization, I'm currently using unordered_map where ever
I'm requiring to map values. It certainly works faster than the std::map
but there are various hashmap implementations that can work faster. Google
dense hash map is one of those and they work way faster than unordered_map
(to my knowledge). I thought of using them to speed value look ups, but
would it affect the portability of the code? Or will it be possible for me
to use it?

I would like to know what documentation I would need to write for the
current clustering API so that I can start with that too,

Also, since the final week for GSoC is starting now, should I list the API
that was merged into a different branch before as merged or as yet to
merge? And incase I am not able to complete PSO by 23rd august since I am
behind the timeline, would it be possible for me to continue it later on?

Thanks.

Regards,
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160816/a94f3ed1/attachment.html>