KMeans - Going forward

James Aylett james at
Sun Jul 23 21:43:12 BST 2017

On 23 Jul 2017, at 20:50, Richhiey Thomas <richhiey.thomas at> wrote:

> Now work on stopword removal and stemming is almost ending and the run time for KMeans seem to be getting lesser (around 0.15 s for 100 documents and this increases to around 1.2 s with 500 documents and 2.5 s with 1000 documents). I tried this out on the BBC datasets available with a value k=5, since there were 5 categories in the dataset.
> Going forward, the next step to optimize KMeans is to use the faster optimized version of KMeans which reduces distance computations developed by Charles Elkan. For this, I will be providing the user an option to specify with the constructor whether they would want the standard algorithm or Elkans algorithm. and write a method within KMeans to implement the triangle inequality optmization. I will also be moving RoundRobin to the testsuite.

Which of the Elkan algorithm and triangle inequality do you expect to have a bigger impact on the runtime? Because it'd be great to do that one first.

(RoundRobin you should move in its own small PR.)


 James Aylett — —

More information about the Xapian-devel mailing list