[Xapian-devel] Ruby bindings now pass all smoketests

Olly Betts olly at survex.com
Mon Apr 24 01:51:23 BST 2006


On Sat, Apr 22, 2006 at 09:19:15PM -0300, Paul Legato wrote:
> Hi Olly, hi Xapian developers,

There's no need to cc: me on list mails incidentally.

> There are now 3 complete examples in ruby/examples - simpleexpand.rb, 
> simpleindex.rb, and simplesearch.rb. They appear to be working well. Is 
> there a standard corpus of test documents with expected results for 
> various operations?

No, though there probably should be as it would provide an additional
test of the bindings, and also check that the implementations of the
simple* examples actually match (we've had bugs there before).

I'll try to put a simple test suite together.

> Some of the search terms suggested by simpleexpand are strange, due to 
> the input terms having been stemmed in simpleindex before insertion into 
> the database; it's producing non-words like "sometim" and "merg". Are 
> input terms typically not stemmed in a production environment, or is 
> there some way to get more user-friendly suggestions?

If you're just using it for query expansion, you might just quietly add
the best N terms found to the query so the user will never see them.

If they're being presented to the user (like in Omega's $topterms) there
are a few tricks - Omega currently checks the "unstem" list from the
query parser, and then looks to see if the term exists in the database in
unstemmed form (Rfoo) and stems to itself (in which case it is its own
stem).

Failing that, Omega shows the stem with a trailing "." which suggests
truncation - so "merg." stands for "merge", "merging", etc.

That doesn't catch all of them though by any means.  Ultimately if
you want to avoid ever showing one to the user, you'll have to maintain
an unstem map (every time you stem a term, make sure it's in the unstem
map), or write an "unstemming" algorithm which will produce a list of
terms which can stem to a particular term (which can be done
algorithmically, possibly using an optimistic approach and culling
false positives by testing with the stemmer).

It occurs to me that Omega could be a bit smarter here.  For the
English stemmer at least, most of the non-word stems can be converted
to words by adding an "e", or less often an "s".  So we could try those
two tricks if the term alone doesn't exist as a word and stem to itself.

I suspect many of the language stemmers could find a word for most stems
giving a few similar hints.

> I haven't yet completely deciphered simplematchdecider.py, nor the 
> wrapper for MatchDecider in Python's util.i.

What simplematchdecider.py does is to run a match but throwing away
any potential match for which the document's value 0 is equal to
the string specified by the second command line argument.

I'll add a comment to it which explains this...

Cheers,
    Olly



More information about the Xapian-devel mailing list