[Xapian-devel] Document clustering module?

Sun Sep 16 19:13:14 BST 2007

On Sun, Sep 16, 2007 at 11:52:20PM +0800, Yung-chung Lin wrote:
> And what is your opinion of using Xapian::Weight to calculate document
> similarity?

Xapian::Weight is set up to score a single document by adding scores
from a set of terms (plus an optional contribution which depends only on
the document length), whereas here we want a score from a pair of
documents.  So I think you'd have to convert one of the documents to a
list of all the terms in it, which seems artificial.

And it seems legitimate to allow clustering using document values (e.g.
you might store geographical coordinates in a document value and cluster
by location), which doesn't fit with Xapian::Weight.

So I think a class which provides a similarity measure given two
Xapian::Document objects is probably the answer.

> > > Now, I am using MultiDSet to store documents. I am thinking if it
> > > would better if it returns multiple MSets, MultiMset, but the design
> > > will be different and more complicated.
> >
> > I think I need to mull over how this would all be used.  Reusing MSet
> > would be nice if it's a good fit, since adding more API classes tends to
> > make it harder to learn the API, so it's good if it can be avoided.  But
> > forcing reuse where something isn't a natural fit would be worse.
> 
> I just gave it a thought and my simple and non-intrusive idea is to
> specify clustering algorithm when using Xapian::Enquire and to
> associate each MSetItem with a cluster id, which would resemble:
> 
>   Enquire enq;
>   ClusterSingleLinkage cluster_algorithm;
>   enq.set_clustering_method(cluster_algorithm);
>   MSet matches = enq.get_mset(1, 10);
>   cout << matches.get_cluster_count() << endl;
>   for (MSetIterator miter = matches.begin(); miter != matches.end(); ++miter) {
>       cout << "Document " << *miter << " is in cluster "
>               << miter->get_cluster_id() << endl;
>   }
> 
> And let API users do what they want to do with the clusters.

Yes, that seems a very nice approach.  It also more naturally allows the
possibility of using document similarity to eliminate near-duplicates -
to do that efficiently you want to do it as matches are generated so
that you can stop when you have enough in the MSet.

It wouldn't allow generating of different clusters of the same results
(without rerunning the search) but that doesn't seem like it's likely to
be an annoying limitation.

Cheers,
    Olly