[Xapian-tickets] [Xapian] #804: Improve clustering API

Mon Aug 31 22:52:13 BST 2020

#804: Improve clustering API
---------------------------------+------------------------
        Reporter:  James Aylett  |      Owner:  Olly Betts
            Type:  enhancement   |     Status:  new
        Priority:  normal        |  Milestone:  1.5.0
       Component:  Library API   |    Version:  git master
        Severity:  normal        |   Keywords:
      Blocked By:                |   Blocking:
Operating System:  All           |
---------------------------------+------------------------
 The clustering API we have is a reasonable first draft, but can be
 improved. Here are some initial thoughts, although not all of them
 necessarily will make things better.

 1. Could do with iterators, as one of the common things is going to be to
 iterate over the clusters in the set, then over the documents inside it.
 My code looks very C-like at the moment :)
 2. The public API doesn't do a huge amount, because it's unreachable. I
 can make all sorts of things out of !FreqSource, but I can't actually use
 them for clustering AFAICT. So I can't for instance create my own vector
 space and cluster within that — without subclassing !KMeans.
 3. Access to original weight in !MSet — unless this is
 !PointType::get_weight(), in which case the docstring is misleading (says
 it's TF-IDF). Even then, it'd be nice to access the original order within
 the !MSet as well. Just looked at the code and it's calculating TF-IDF
 directly to compute term weight and magnitude. That's probably okay, but
 it feels a little odd to me that this happens in the Point constructor
 rather than in the !FreqSource.
 4. There's only one similarity measure, but there doesn't seem to be a way
 to set another if I implemented my own.
 5. I suspect that being able to specify a term prefix and only initialise
 the vector space on that would be helpful. You can do this via a stopper,
 but that's going to be less efficient if clustering lots of docs.
 6. I wonder if we should convert to an integer-indexed (but sparse) vector
 space on intialisation. Using terms throughout is almost certainly slower?
 Changing that should mean that Point means a bit more to end users who are
 controlling things themselves, because they can build a vector space
 unrelated to terms.
 7. I don't know if this is feasible, but it'd be nice given a Cluster to
 be able to get some stats about it, or at least stats about the Points
 within it. Distance from Centroid, for instance — which I can compute
 directly via the public API, but would be helpful sometimes. (For instance
 if you want to name a cluster, you can either run an algorithm to ponder
 that out of all points, or you can have topics for each document as doc
 data and just use the one closest to the centroid. I guess you could just
 use the first doc in the cluster though.)
-- 
Ticket URL: <https://trac.xapian.org/ticket/804>
Xapian <https://xapian.org/>
Xapian