[Xapian-devel] Document clustering module?

☼ 林永忠 ☼ (Yung-chung Lin) henearkrxern at gmail.com
Mon Sep 17 02:31:23 BST 2007


> > I just gave it a thought and my simple and non-intrusive idea is to
> > specify clustering algorithm when using Xapian::Enquire and to
> > associate each MSetItem with a cluster id, which would resemble:
> >
> >   Enquire enq;
> >   ClusterSingleLinkage cluster_algorithm;
> >   enq.set_clustering_method(cluster_algorithm);
> >   MSet matches = enq.get_mset(1, 10);
> >   cout << matches.get_cluster_count() << endl;
> >   for (MSetIterator miter = matches.begin(); miter != matches.end(); ++miter) {
> >       cout << "Document " << *miter << " is in cluster "
> >               << miter->get_cluster_id() << endl;
> >   }
> >
> > And let API users do what they want to do with the clusters.
>
> Yes, that seems a very nice approach.  It also more naturally allows the
> possibility of using document similarity to eliminate near-duplicates -
> to do that efficiently you want to do it as matches are generated so
> that you can stop when you have enough in the MSet.
>
> It wouldn't allow generating of different clusters of the same results
> (without rerunning the search) but that doesn't seem like it's likely to
> be an annoying limitation.

Calling cluster_algorithm.cluster_mset(matches) manually may
re-cluster matches and you can also choose another clustering
algorithm. What about this?

Best,
Yung-chung Lin



More information about the Xapian-devel mailing list