KMeans - Evaluation Results

Wed Aug 17 15:22:23 BST 2016

On Wed, Aug 17, 2016 at 7:23 PM, James Aylett <james-xapian at tartarus.org>
wrote:

> >> How long does 200–300 documents take to cluster? How does it grow as
> more documents are included in the MSet? We'd expect an MSet of 1000
> documents to take longer to cluster than one with 100, but the important
> thing is _how_ the time increases as the number of documents grows.
> >
> > Currently, the number of seconds taken for clustering a set of documents
> for varying sizes is :
> >
> > 100 documents - 0.50 s
> > 200 documents - 1.5 s
> > 300 documents - 4.5 s
> > 400 documents - 6.02 s
> > 500 documents - 10.3 s
> > 600 documents - 17.02 s
> > 700 documents - 23.56 s
> > 800 documents - 29.12 s
> > 900 documents - 36.87 s
> > 1000 documents - 42.46 s
>
> Did you run a profiler over this? I'm wondering where the time is being
> spent. Some notes from the exploratory work done in 2014 over the
> svn/clustering branch may be helpful here:
>
> > By reading the code I noticed that the current code uses a tree map
> (std::map) that introduced a log(N) factor in the complexity of any
> operation involving that structure, so I decided to switch to hash maps
> (std::tr1::unordered_map). This considerably reduced the time and made it
> possible to run the code under the profiler in a decent amount of time. I
> noticed that most of the time (about 94%) was spent in
> Xapian::DocSimCosine::similarity. As I was looking through the code, I
> noticed that the inner product computation was wrong and I reimplemented
> it. While doing that, I also changed the implementation to use a hash map
> instead of vector. I also noticed that the TF-IDF score was computed for
> every document whenever the similarity was called. I then changed the
> interface for Xapian::DocSimCosine by adding a new method,
> present_document, which would allow to precompute the scores.
>
>
Yes that's right. I've stored the TF-IDF weights of terms in hashmap and
even pre computed values for documents, since they do not change. Its only
the centroid values which change.
As the number of documents increase, the number of terms which reside in
the centroid increase and as a result, it gets increasingly difficult to
deal with the increasing dimensionality.
I did run a profiler and from the output, I could conclude that most of the
time is taken in value look ups (One of the reasons I was looking to speed
look ups). Hence, if the number of dimensions can be reduced, the speed of
clustering can increase easily.

You continued:
>
> > Currently, as you had mentioned that pruning the API for hiding
> implementation of things that are not part of the public API is an
> important thing to do. So I was looking at how PIMPL has been adopted in
> Xapian, and if I'm not wrong, this has been done with the Internal class.
> But I hadn't written the API in a way to agree with that design. Any tips
> or guidelines I could get in order to make the current API conform with
> PIMPL as implemented in Xapian?
>
> You should be able to take a class at a time in the public API, hiding
> away details that you don't need. If you start with Clusterer, and move
> everything that isn't needed by users of the clustering system out of
> public visibility; then you can move any accessors or methods working on
> private members into an Internal class (you can make one initially with no
> members, and move things bit by bit). Then gradually move through the
> classes that are used by its public API (such as Cluster).
>
> Note that this means you have to complete the refactor to introduce a
> Clusterer class first. (Or maybe 'ClusteringAlgorithm' might be clearer.)
>
> I'm not sure you need DocumentSet as you've implemented things (just
> provide iterators over the documents in the Cluster's API, and put the
> vector<Document> directly in the Cluster). That should make things slightly
> easier.
>
>
I have a two questions about this since I haven't worked with PIMPL before:

1) When I'm dealing with abstract classes, how do I make an Internal class
for this? Because as I understand it, for every class we will have to make
a new Internal class and then implement the functions within that Internal
class. So I wouldn't need to make an Internal class for an abstract class,
right?

2) Incase of classes like Point and Centroid, these aren't a part of the
pubic API but are used internally almost everywhere. So I will need to
define these in the Internal namespace?

Thanks.

Regards,
Richhey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160817/7aafc033/attachment.html>