[Xapian-devel] stemtest failing with romanian

Olly Betts olly at survex.com
Thu Mar 29 18:12:24 BST 2007


On Thu, Mar 29, 2007 at 05:52:46PM +0100, Richard Boulton wrote:
> Ah! I see the problem, I think (though I don't know how it could ever 
> have passed - I must have mis-remembered that...). The word which fails 
> ends with a capital "I": stemwords lowercases words before passing them 
> to the stemmer, but stemtest doesn't.  If I change this letter to a 
> lowercase I, the test passes.

Ah yes, I had fixed "stemtest" to lowercase words (for about 3 words in
the Hungarian test vocab), but I reverted that change recently since
Martin agreed it was reasonable for the test vocabularies to be in lower
case.  But my xapian-data wasn't up-to-date then so I didn't spot the
test regression this caused.  Sorry.

> I don't know where the romanian data came from, incidentally; I don't 
> even know if it actually contains romanian words.

It's snowball's old "data/romanian1/voc.txt".  Looks like Martin pruned
the list or substituted a shorter one when he did his version of the
romanian stemmer.

> Copying the data files over from snowball also makes the test pass.
> It doesn't seem to be a bad idea to have different test data in xapian
> core that in snowball, if both are actually romanian... though it
> might be nice to append the data file from snowball to the end of the
> xapian one, to get better coverage.

I think we should definitely include the words from snowball.  Testing
more is fine, provided we don't go overboard and render "make check"
unusably slow!  We have some code generation changes relative to
snowball still, and it's good to have confidence that we give the same
results as snowball on their vocabulary lists, and also on other inputs.

Perhaps it would be good to have "snowball" and "extra" lists separate
- e.g. stemdict1 could test the snowball lists, and a new stemdict2 the
extra lists.

> For now, I've removed the offending word with the capital letter (I 
> would just have lowercased it, but the following word is the lowercase 
> version), and the test now passes.

Sounds good.

Cheers,
    Olly



More information about the Xapian-devel mailing list