[Xapian-devel] stemtest failing with romanian
Richard Boulton
richard at lemurconsulting.com
Thu Mar 29 17:52:46 BST 2007
Olly Betts wrote:
> However, I don't think the code generator change is to blame. It looks
> to me like romanian test data just isn't in step with that from snowball
> for some reason:
Hmm - yes - but I generated the romanian.st file from romanian.voc using
the latest snowball, so the algorithms in Xapian should stem the
contents to the same values. Unless I mis-invoked stemwords when
generating the file, of course.
*time passes*
Ah! I see the problem, I think (though I don't know how it could ever
have passed - I must have mis-remembered that...). The word which fails
ends with a capital "I": stemwords lowercases words before passing them
to the stemmer, but stemtest doesn't. If I change this letter to a
lowercase I, the test passes.
I don't know where the romanian data came from, incidentally; I don't
even know if it actually contains romanian words. Copying the data
files over from snowball also makes the test pass. It doesn't seem to
be a bad idea to have different test data in xapian core that in
snowball, if both are actually romanian... though it might be nice to
append the data file from snowball to the end of the xapian one, to get
better coverage.
For now, I've removed the offending word with the capital letter (I
would just have lowercased it, but the following word is the lowercase
version), and the test now passes.
--
Richard
More information about the Xapian-devel
mailing list