K MEANS clustering

Richhiey Thomas richhiey.thomas at gmail.com
Wed Jul 27 13:47:04 BST 2016


Hey Parth,

Thanks for the reply.
I am considering implementing a cosine distance metric too, along with
euclidian distance because of the dimensionality issue that comes in with
K-Means and euclidian distance metric.
That does help when we deal with sparse vectors for documents. The
particular problem I'm having is representing centroids in an efficient way.
For example, when we find the mean vector of a cluster, the resultant
centroid need not be a document vector of a document belonging to that
cluster. Hence representing that cluster, which will be dense as a C++ map
is inefficient because of the number of terms associated with it and
calculating distances with that doesn't work or scale too well.
Over that, my distance calculation works over two documents. So will I need
to modify that in a way to accommodate arbitrary vectors which might not
represent document vectors?
Would be great if everyone could add there inputs on this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160727/d164d0d8/attachment.html>


More information about the Xapian-devel mailing list