[Xapian-devel] Document clustering module?
☼ 林永忠 ☼ (Yung-chung Lin)
henearkrxern at gmail.com
Mon Sep 17 02:31:23 BST 2007
> > I just gave it a thought and my simple and non-intrusive idea is to
> > specify clustering algorithm when using Xapian::Enquire and to
> > associate each MSetItem with a cluster id, which would resemble:
> >
> > Enquire enq;
> > ClusterSingleLinkage cluster_algorithm;
> > enq.set_clustering_method(cluster_algorithm);
> > MSet matches = enq.get_mset(1, 10);
> > cout << matches.get_cluster_count() << endl;
> > for (MSetIterator miter = matches.begin(); miter != matches.end(); ++miter) {
> > cout << "Document " << *miter << " is in cluster "
> > << miter->get_cluster_id() << endl;
> > }
> >
> > And let API users do what they want to do with the clusters.
>
> Yes, that seems a very nice approach. It also more naturally allows the
> possibility of using document similarity to eliminate near-duplicates -
> to do that efficiently you want to do it as matches are generated so
> that you can stop when you have enough in the MSet.
>
> It wouldn't allow generating of different clusters of the same results
> (without rerunning the search) but that doesn't seem like it's likely to
> be an annoying limitation.
Calling cluster_algorithm.cluster_mset(matches) manually may
re-cluster matches and you can also choose another clustering
algorithm. What about this?
Best,
Yung-chung Lin
More information about the Xapian-devel
mailing list