KMeans - Evaluation Results

Parth Gupta pargup8 at gmail.com
Mon Aug 29 07:11:01 BST 2016


Hi Richhiey

Some comments on the report on silhouette coefficient.  Also the results
with single query are not reliable. Better to evaluated with more queries.
The setup I mentioned in my earlier email to use each document as query is
a good way to achieve some statistically significant number.

Silhouette coeff is usually used to select the correct k and talks about
how the clusters are close (or separable) to each other. The purity and
rand index are more quality based metrics which says how the clusters. For
example, purity of a cluster is calculated as (max number of elements of
one type)/ total number of elements. Though it can be that all the cluster
have the documents of the same category and purity is still 1. Though that
is still fine because our MSet is as such and we need to improve diversity.
That is a different story.

Rand Index is a pair based metric which goes into the direction of
correctly clustered pairs and cluster accuracy.

Both of them require labels which we should be able to get from the
datasets I mentioned earlier. These metrics are explained nicely with
examples here:
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

Cheers
Parth



On Fri, Aug 26, 2016 at 2:29 PM, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:

> Hello,
>
> I have started with evaluation of Clusterers so I can improve on them and
> have used the silhouette coefficient for starting off.
>
> The results I have added in a google doc. Hope you check it out and let me
> know how I can improve and go ahead.
>
> https://docs.google.com/document/d/1vpG_iPH4rRIhNxeJ87MZy-
> yBHfJBlcIyf6M33PsDwBc/edit?usp=sharing
>
> Also, parth, could you explain in a little more detail how external
> measures like purity and rand index can be calculated with unlabeled data
> that we have? I'm currently only looking at internal measures ..
>
> Regards,
> Richhiey
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160829/c0a719c6/attachment.html>


More information about the Xapian-devel mailing list