[Xapian-devel] Document clustering module?
☼ 林永忠 ☼ (Yung-chung Lin)
henearkrxern at gmail.com
Sun Sep 16 16:52:20 BST 2007
> > Maybe putting the similarity function into a class would be even
> > better. It needs discussion.
>
> I think that is probably the answer.
And what is your opinion of using Xapian::Weight to calculate document
similarity?
I have not read through the code yet, but I just think they seem heavy
in this use.
>
> > Now, I am using MultiDSet to store documents. I am thinking if it
> > would better if it returns multiple MSets, MultiMset, but the design
> > will be different and more complicated.
>
> I think I need to mull over how this would all be used. Reusing MSet
> would be nice if it's a good fit, since adding more API classes tends to
> make it harder to learn the API, so it's good if it can be avoided. But
> forcing reuse where something isn't a natural fit would be worse.
>
I just gave it a thought and my simple and non-intrusive idea is to
specify clustering algorithm when using Xapian::Enquire and to
associate each MSetItem with a cluster id, which would resemble:
Enquire enq;
ClusterSingleLinkage cluster_algorithm;
enq.set_clustering_method(cluster_algorithm);
MSet matches = enq.get_mset(1, 10);
cout << matches.get_cluster_count() << endl;
for (MSetIterator miter = matches.begin(); miter != matches.end(); ++miter) {
cout << "Document " << *miter << " is in cluster "
<< miter->get_cluster_id() << endl;
}
And let API users do what they want to do with the clusters.
Best,
Yung-chung Lin
More information about the Xapian-devel
mailing list