K MEANS clustering

Wed Jul 27 03:28:51 BST 2016

Hey Parth,

Thanks for the reply.
I am considering implementing a cosine distance metric because of the
dimensionality issue that comes in with K-Means and euclidian distance
metric.

Currently, the way I'm finding distances between documents is finding their
terms and looking up their term frequencies which I've stored in a map. So
I've not stored a unique vector for every document. Now in KMeans, when we
find the mean of a cluster, the resultant need not be a document vector. So
representing these centroids is becoming a problem since the centroids will
be dense. Should I use a map for that too? By storing all the terms and
their avg values.
Or would it be a better approach to have a document vector for every
document stored?

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160727/eebf302f/attachment.html>