K MEANS clustering

Tue Jul 26 05:48:40 BST 2016

Hello,

I've been working on the KMeans clustering algorithm recently and since the
past week, I have been stuck on a problem which I'm not able to find a
solution to.

Since we are representing documents as Tf-idf vectors, they are really
sparse vectors (a usual corpus can have around 5000 terms). So it gets
really difficult to represent these sparse vectors in a way that would be
computationally efficient to calculate euclidian distances. I had
implemented a K-Medioids algorithm using PAM just to try it out, after
modifying the API for whatever more was required, and that seems fine,
since we are dealing with document vectors and not arbitrary vectors. But
with KMeans, I am not able to figure out how to represent these centroids
during each iteration when the average of a cluster is to be computed.
So my confusion is, how could i represent an arbitrary sparse vector to be
used as the centroid in k means?
Can anyone please guide me on this?
Will using boost C++ be a solution?

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160726/342a6e4b/attachment.html>