[Xapian-tickets] [Xapian] #804: Improve clustering API

Sun Mar 19 22:50:12 GMT 2023

#804: Improve clustering API
--------------------------+-------------------------------
 Reporter:  James Aylett  |             Owner:  Olly Betts
     Type:  enhancement   |            Status:  new
 Priority:  normal        |         Milestone:  1.5.0
Component:  Library API   |           Version:  git master
 Severity:  normal        |        Resolution:
 Keywords:                |        Blocked By:
 Blocking:                |  Operating System:  All
--------------------------+-------------------------------

Old description:

> The clustering API we have is a reasonable first draft, but can be
> improved. Here are some initial thoughts, although not all of them
> necessarily will make things better.
>
> 1. Could do with iterators, as one of the common things is going to be to
> iterate over the clusters in the set, then over the documents inside it.
> My code looks very C-like at the moment :)
>
> ''We could use the same pattern as `MSetIterator` and `ESetIterator` with
> an iterator class holding a (ref-counted PIMPL) container object and
> (size-index) - that means end iterator is cheap to construct as it's
> (NULL,0), and a test against the end is just member_var == 0.  We could
> probably make these classes entirely inline.  Related, it would be great
> to support C++11 range for for all our iterators, which is fairly easy
> for the trivial version that just gives you `operator*()` at each
> position, but then you e.g. can't get wdf as well as the term name from a
> `TermIterator` with a range-for loop, and even have to rewrite your loop
> if you later do.''
> 2. The public API doesn't do a huge amount, because it's unreachable. I
> can make all sorts of things out of `!reqSource`, but I can't actually
> use them for clustering AFAICT. So I can't for instance create my own
> vector space and cluster within that — without subclassing `KMeans`.
> 3. Access to original weight in `MSet` — unless this is
> `PointType::get_weight()`, in which case the docstring is misleading
> (says it's TF-IDF). Even then, it'd be nice to access the original order
> within the `MSet` as well. Just looked at the code and it's calculating
> TF-IDF directly to compute term weight and magnitude. That's probably
> okay, but it feels a little odd to me that this happens in the Point
> constructor rather than in the `FreqSource`.
> 4. There's only one similarity measure, but there doesn't seem to be a
> way to set another if I implemented my own.
> 5. I suspect that being able to specify a term prefix and only initialise
> the vector space on that would be helpful. You can do this via a stopper,
> but that's going to be less efficient if clustering lots of docs.
>
> ''It's also more convenient to just specify the prefix.  When the input
> is `Document` termlists (as it currently has to be) then
> `skip_to(prefix)` is just a `next()` loop internally because of how these
> are encoded on disk, so the efficiency gain would mainly be that we could
> stop decoding once we'd got past the prefix of interest.  That's partly
> an implementation detail - we could store data to allow actually skipping
> ahead in a document termlist.''
> 6. I wonder if we should convert to an integer-indexed (but sparse)
> vector space on intialisation. Using terms throughout is almost certainly
> slower? Changing that should mean that Point means a bit more to end
> users who are controlling things themselves, because they can build a
> vector space unrelated to terms.
>
> ''For matching and in backends we stick with terms, but that's mostly
> because otherwise we'd need to a way to account for differing numeric
> term ids across databases and that isn't a concern here.  Using strings
> may not be as bad as you might guess since most terms are short and
> different early on, and with integers you'd have to convert, but I think
> it's worth trying with integers.''
> 7. I don't know if this is feasible, but it'd be nice given a Cluster to
> be able to get some stats about it, or at least stats about the Points
> within it. Distance from Centroid, for instance — which I can compute
> directly via the public API, but would be helpful sometimes. (For
> instance if you want to name a cluster, you can either run an algorithm
> to ponder that out of all points, or you can have topics for each
> document as doc data and just use the one closest to the centroid. I
> guess you could just use the first doc in the cluster though.)

New description:

 The clustering API we have is a reasonable first draft, but can be
 improved. Here are some initial thoughts, although not all of them
 necessarily will make things better.

 1. Could do with iterators, as one of the common things is going to be to
 iterate over the clusters in the set, then over the documents inside it.
 My code looks very C-like at the moment :)

 ''We could use the same pattern as `MSetIterator` and `ESetIterator` with
 an iterator class holding a (ref-counted PIMPL) container object and
 (size-index) - that means end iterator is cheap to construct as it's
 (NULL,0), and a test against the end is just member_var == 0.  We could
 probably make these classes entirely inline.  Related, it would be great
 to support C++11 range for for all our iterators, which is fairly easy for
 the trivial version that just gives you `operator*()` at each position,
 but then you e.g. can't get wdf as well as the term name from a
 `TermIterator` with a range-for loop, and even have to rewrite your loop
 if you later do.''
 2. The public API doesn't do a huge amount, because it's unreachable. I
 can make all sorts of things out of `FreqSource`, but I can't actually use
 them for clustering AFAICT. So I can't for instance create my own vector
 space and cluster within that — without subclassing `KMeans`.
 3. Access to original weight in `MSet` — unless this is
 `PointType::get_weight()`, in which case the docstring is misleading (says
 it's TF-IDF). Even then, it'd be nice to access the original order within
 the `MSet` as well. Just looked at the code and it's calculating TF-IDF
 directly to compute term weight and magnitude. That's probably okay, but
 it feels a little odd to me that this happens in the Point constructor
 rather than in the `FreqSource`.
 4. There's only one similarity measure, but there doesn't seem to be a way
 to set another if I implemented my own.

 ''Indeed - `CosineDistance` objects are created in xapian-core library
 code and used, which (a) means that `Similarity` and `CosineDistance`
 classes aren't useful in the public API currently and (b) means that these
 methods being virtual is just needless overhead (unless the compiler is
 capable of devirtualising all the calls).''

 5. I suspect that being able to specify a term prefix and only initialise
 the vector space on that would be helpful. You can do this via a stopper,
 but that's going to be less efficient if clustering lots of docs.

 ''It's also more convenient to just specify the prefix.  When the input is
 `Document` termlists (as it currently has to be) then `skip_to(prefix)` is
 just a `next()` loop internally because of how these are encoded on disk,
 so the efficiency gain would mainly be that we could stop decoding once
 we'd got past the prefix of interest.  That's partly an implementation
 detail - we could store data to allow actually skipping ahead in a
 document termlist.''
 6. I wonder if we should convert to an integer-indexed (but sparse) vector
 space on intialisation. Using terms throughout is almost certainly slower?
 Changing that should mean that Point means a bit more to end users who are
 controlling things themselves, because they can build a vector space
 unrelated to terms.

 ''For matching and in backends we stick with terms, but that's mostly
 because otherwise we'd need to a way to account for differing numeric term
 ids across databases and that isn't a concern here.  Using strings may not
 be as bad as you might guess since most terms are short and different
 early on, and with integers you'd have to convert, but I think it's worth
 trying with integers.''
 7. I don't know if this is feasible, but it'd be nice given a Cluster to
 be able to get some stats about it, or at least stats about the Points
 within it. Distance from Centroid, for instance — which I can compute
 directly via the public API, but would be helpful sometimes. (For instance
 if you want to name a cluster, you can either run an algorithm to ponder
 that out of all points, or you can have topics for each document as doc
 data and just use the one closest to the centroid. I guess you could just
 use the first doc in the cluster though.)

--
Comment (by Olly Betts):

 Related to this is the `Xapian::Diversify` API - this uses clustering
 internally, and the result is a `Xapian::DocumentSet` object.  Two things
 strike me about this API:

 (a) I'm not convinced the `Diversify` class needs to exist at all - the
 usage pattern currently is:

 {{{
 Xapian::MSet mset = enquire.get_mset(0, 100);
 Xapian::Diversify diversify(k, r, lambda, b, sigma_sqr);
 Xapian::DocumentSet dset = diversify.get_dmset(mset);
 }}}

 We could just provide a new `MSet` method to do this with one fewer
 object/step:

 {{{
 Xapian::MSet mset = enquire.get_mset(0, 100);
 Xapian::DocumentSet dset = mset.diversify(k, r, lambda, b, sigma_sqr);
 }}}

 (b) It seems inconvenient that the result of this isn't an `MSet` object -
 currently to implement an application which optionally diversifies results
 you need to have two versions of result
 display, one using `MSet` and one using `DocumentSet`.

 Problem (b) is perhaps relevant to clustering too, though maybe clustered
 results typically require custom display code anyway.
-- 
Ticket URL: <https://trac.xapian.org/ticket/804#comment:2>
Xapian <https://xapian.org/>
Xapian