[Xapian-devel] Document clustering module?

Sun Sep 16 16:52:20 BST 2007

> > Maybe putting the similarity function into a class would be even
> > better. It needs discussion.
>
> I think that is probably the answer.

And what is your opinion of using Xapian::Weight to calculate document
similarity?
I have not read through the code yet, but I just think they seem heavy
in this use.

>
> > Now, I am using MultiDSet to store documents. I am thinking if it
> > would better if it returns multiple MSets, MultiMset, but the design
> > will be different and more complicated.
>
> I think I need to mull over how this would all be used.  Reusing MSet
> would be nice if it's a good fit, since adding more API classes tends to
> make it harder to learn the API, so it's good if it can be avoided.  But
> forcing reuse where something isn't a natural fit would be worse.
>

I just gave it a thought and my simple and non-intrusive idea is to
specify clustering algorithm when using Xapian::Enquire and to
associate each MSetItem with a cluster id, which would resemble:

  Enquire enq;
  ClusterSingleLinkage cluster_algorithm;
  enq.set_clustering_method(cluster_algorithm);
  MSet matches = enq.get_mset(1, 10);
  cout << matches.get_cluster_count() << endl;
  for (MSetIterator miter = matches.begin(); miter != matches.end(); ++miter) {
      cout << "Document " << *miter << " is in cluster "
              << miter->get_cluster_id() << endl;
  }

And let API users do what they want to do with the clusters.

Best,
Yung-chung Lin