K MEANS clustering

Parth Gupta pargup8 at gmail.com
Thu Jul 28 07:45:51 BST 2016


Hi Richhiey

Storing the centroids as double arrays is a better choice because of their
dense nature and simplicity for operating over in parallel e.g. if you want
to pass it to BLAS subroutine. In my previous email, I tried to calculate
space requirements for it.

A document can be stored in the sparse format (like you do with map) but
before passing it to a cosine similarity subroutine you can make it dense
(create a new double array and set its particular non-zero indexes).
Alternatively, if you decide to operate in the sparse space, you can
efficiently access centroid entries at indexes for which document has
non-zero entry.

Basically, here you have to make a design choice. In the former case, you
can use BLAS like subroutines, while in the latter, you save computation.
The former is also valid for the euclidean distance metric while the latter
is not. Olly/James might have an opinion (may be from the previous cluster
branch).

Cheers
Parth

On Wed, Jul 27, 2016 at 6:17 PM, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:

> Hey Parth,
>
> Thanks for the reply.
> I am considering implementing a cosine distance metric too, along with
> euclidian distance because of the dimensionality issue that comes in with
> K-Means and euclidian distance metric.
> That does help when we deal with sparse vectors for documents. The
> particular problem I'm having is representing centroids in an efficient way.
> For example, when we find the mean vector of a cluster, the resultant
> centroid need not be a document vector of a document belonging to that
> cluster. Hence representing that cluster, which will be dense as a C++ map
> is inefficient because of the number of terms associated with it and
> calculating distances with that doesn't work or scale too well.
> Over that, my distance calculation works over two documents. So will I
> need to modify that in a way to accommodate arbitrary vectors which might
> not represent document vectors?
> Would be great if everyone could add there inputs on this.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160728/6800036f/attachment.html>


More information about the Xapian-devel mailing list