KMeans - Evaluation Results

Wed Aug 17 14:53:37 BST 2016

On 17 Aug 2016, at 13:40, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:

>> How long does 200–300 documents take to cluster? How does it grow as more documents are included in the MSet? We'd expect an MSet of 1000 documents to take longer to cluster than one with 100, but the important thing is _how_ the time increases as the number of documents grows.
> 
> Currently, the number of seconds taken for clustering a set of documents for varying sizes is :
> 
> 100 documents - 0.50 s
> 200 documents - 1.5 s
> 300 documents - 4.5 s
> 400 documents - 6.02 s
> 500 documents - 10.3 s
> 600 documents - 17.02 s
> 700 documents - 23.56 s
> 800 documents - 29.12 s
> 900 documents - 36.87 s
> 1000 documents - 42.46 s

Did you run a profiler over this? I'm wondering where the time is being spent. Some notes from the exploratory work done in 2014 over the svn/clustering branch may be helpful here:

> By reading the code I noticed that the current code uses a tree map (std::map) that introduced a log(N) factor in the complexity of any operation involving that structure, so I decided to switch to hash maps (std::tr1::unordered_map). This considerably reduced the time and made it possible to run the code under the profiler in a decent amount of time. I noticed that most of the time (about 94%) was spent in Xapian::DocSimCosine::similarity. As I was looking through the code, I noticed that the inner product computation was wrong and I reimplemented it. While doing that, I also changed the implementation to use a hash map instead of vector. I also noticed that the TF-IDF score was computed for every document whenever the similarity was called. I then changed the interface for Xapian::DocSimCosine by adding a new method, present_document, which would allow to precompute the scores.

You continued:

> Currently, as you had mentioned that pruning the API for hiding implementation of things that are not part of the public API is an important thing to do. So I was looking at how PIMPL has been adopted in Xapian, and if I'm not wrong, this has been done with the Internal class. But I hadn't written the API in a way to agree with that design. Any tips or guidelines I could get in order to make the current API conform with PIMPL as implemented in Xapian?

You should be able to take a class at a time in the public API, hiding away details that you don't need. If you start with Clusterer, and move everything that isn't needed by users of the clustering system out of public visibility; then you can move any accessors or methods working on private members into an Internal class (you can make one initially with no members, and move things bit by bit). Then gradually move through the classes that are used by its public API (such as Cluster).

Note that this means you have to complete the refactor to introduce a Clusterer class first. (Or maybe 'ClusteringAlgorithm' might be clearer.)

I'm not sure you need DocumentSet as you've implemented things (just provide iterators over the documents in the Cluster's API, and put the vector<Document> directly in the Cluster). That should make things slightly easier.

J

-- 
 James Aylett, +44 7976 212023
 Principal Consultant, Kittens with Jetpacks Ltd

 Registered company no 6587395
 19 Cottrell Court, Southern Way, London SE10 0DW