[Xapian-devel] Document clustering module?
☼ 林永忠 ☼ (Yung-chung Lin)
henearkrxern at gmail.com
Thu Sep 20 23:41:24 BST 2007
On 9/21/07, Olly Betts <olly at survex.com> wrote:
> On Thu, Sep 20, 2007 at 11:12:55PM +0800, ??? ????????? ??? (Yung-chung Lin) wrote:
> > +static bool test_docsim1() {
> > + Xapian::Database db(get_database("etext"));
> > + string query = "Time";
> > + Xapian::Stem stemmer("en");
> > +
> > + Xapian::QueryParser qp;
> > + qp.set_database(db);
> > + qp.set_stemmer(stemmer);
> > +
> > + Xapian::Enquire enq(db);
> > + enq.set_query(qp.parse_query(query));
>
> It doesn't really matter, but we aren't trying to test the QueryParser
> here so I'd just go for the simpler and slightly faster option:
>
> enq.set_query(Xapian::Query(stemmer("time")));
>
I'll fix this part.
> > + Xapian::MSet matches = enq.get_mset(0, 30);
> > +
> > + Xapian::DocSimCosine doc_sim;
> > + doc_sim.set_database(db);
> > + for (Xapian::doccount i = 0; i < matches.size(); i+=2) {
> > + double sim
> > + = doc_sim.calculate_similarity(matches[i].get_document(),
> > + matches[i+1].get_document());
> > + TEST(sim >= 0 && sim <= 1);
>
> All the documents under consideration contain "term", so we should have
> sim > 0 for DocSimCosine.
>
> Incidentally, what properties do we actually require of similarity
> measures? We should document these so people creating their own
> subclasses know what is required.
>
> I think we definitely require:
>
> * sim(a,b) == sim(b,a) (commutative)
> * sim(a,b) >= 0
>
> We may also want:
>
> * sim(a,b) <= 1
> * sim(a,a) == 1
I forgot to test the properties listed here. I'll add them and the doc
in next update.
>
> Are there any others?
I am not sure if there are more.
>
> > + protected:
> > + /// The database from which documents are retrieved.
> > + Database db;
>
> I'm not sure if this should be in the base class or not. It's certainly
> possible to write a subclass which doesn't need it (e.g. count the
> number of terms in common and divide by the greater number of terms)
> although many will need it. It's not much of an overhead though.
>
I think I will keep it since it is handy. And, as you said, it's not
much of an overhead.
> > +#include <math.h>
>
> We recently decided to prefer `#include <cmath>' in new code (there are
> semantic differences in some instances, so we're holding off a wholesale
> conversion until 1.1.0 to avoid any risk of breaking existing code).
>
I'll convert to cmath.
> > + for (wt_iter = wt_a.begin(); wt_iter != wt_a.end(); ++wt_iter) {
> > + wt_iter->second /= wt_a_denom;
> > + }
> > +
> > + for (wt_iter = wt_b.begin(); wt_iter != wt_b.end(); ++wt_iter) {
> > + wt_iter->second /= wt_b_denom;
> > + }
>
> Unless I'm missing something we don't need to do this normalisation at
> all, since wt_a_denom and wt_b_denom just factor out of wt_sq_sum_a,
> wt_sq_sum_b, and inner_product and then cancel in the final calculation!
>
> Are you doing it like this because you're worried about overflow or
> something like that?
Yes, I was thinking about overflow, although that probably would not
happen in normal cases. I think I will rewrite this part.
>
> If we don't need to normalise here, it seems we don't need wt_a or wt_b
> or wt_a_denom or wt_b_denom at all - we can just run through the
> termlists of the two documents in parallel, advancing whichever
> currently has the lowest term, and calculating the three sums at
> once.
>
> If you don't follow, don't worry - I can try rewriting it before I apply
> it and see if it works.
>
> Cheers,
> Olly
>
Best,
Yung-chung Lin
More information about the Xapian-devel
mailing list