[Xapian-devel] stemtest failing with romanian

Richard Boulton richard at lemurconsulting.com
Thu Mar 29 17:52:46 BST 2007


Olly Betts wrote:
> However, I don't think the code generator change is to blame.  It looks
> to me like romanian test data just isn't in step with that from snowball
> for some reason:

Hmm - yes - but I generated the romanian.st file from romanian.voc using 
the latest snowball, so the algorithms in Xapian should stem the 
contents to the same values.  Unless I mis-invoked stemwords when 
generating the file, of course.

*time passes*

Ah! I see the problem, I think (though I don't know how it could ever 
have passed - I must have mis-remembered that...). The word which fails 
ends with a capital "I": stemwords lowercases words before passing them 
to the stemmer, but stemtest doesn't.  If I change this letter to a 
lowercase I, the test passes.

I don't know where the romanian data came from, incidentally; I don't 
even know if it actually contains romanian words.  Copying the data 
files over from snowball also makes the test pass.  It doesn't seem to 
be a bad idea to have different test data in xapian core that in 
snowball, if both are actually romanian... though it might be nice to 
append the data file from snowball to the end of the xapian one, to get 
better coverage.

For now, I've removed the offending word with the capital letter (I 
would just have lowercased it, but the following word is the lowercase 
version), and the test now passes.

-- 
Richard



More information about the Xapian-devel mailing list