[Xapian-discuss] Spelling based on frequency and not just distance

Olly Betts olly at survex.com
Thu Jan 17 03:05:42 GMT 2008


On Tue, Jan 15, 2008 at 01:24:33AM -0800, Philip Neustrom wrote:
> After implementing the new spelling functionality on http://wikispot.org I
> noticed that terms like "wikipeda" weren't yielding spelling suggestions.
> Taking a quick look at the code, it looks like if we find an exact match,
> even if it has a frequency less than another match within the provided
> delta, we don't suggest anything.  This is probably fine for sites with
> documents where you can be assured the data is properly spelled -- but not
> suitable for something like a wiki or the web in general.

I'm not sure I believe there's any non-trivial collection of documents
without spelling mistakes, but certainly the current spelling correction
can be problematic.  The current scheme actually does OK when
misspelling is rampant, then the incorrect spelling will typically
return a useful set of results too.  

For example, searching Google for "wikipeda" finds "about 64,400"
results, and a quick look suggest most are relevant.  The sheer size of
the index helps here too.

It's not just misspellings in documents which are problematic with the
current scheme.  A genuine word which is also a typo for another word 
is too.  In some cases, both words are common and not a lot can really
be done anyway (e.g. "biking" vs "bikini").  In others, one word is
common and the other sufficiently obscure that it's unlikely to be what
the user meant to search for (e.g. "agent" vs "ahent" - I don't actually
even know what "ahent" means - I'd guess it's related to "hent" - but
it's a valid play in Scrabble!)

Some heuristic based on the relative frequencies (and possibly something
like the average frequency, which I don't think we currently know but
could track easily enough) seems like a good approach.

Another source of spelling information is logs - what offered spelling
corrections users have previously accepted is obviously interesting, but
also just what has been searched for and whether the user performed
another search within a short time interval may be a source of useful
information.  It's hard to see how exactly to feed such information
in though.

> The patch attached to this email is better than the previous.  Hopefully
> somebody can come up with something better entirely, as I'm not totally
> happy with what I have -- it tends to suggest things like "plant" for
> "plants" and then "plan" for "plant" :)

That sounds rather undesirable though.

Probably the best thing to do is open a bug and attach your patch so it
doesn't just get forgotten.

Cheers,
    Olly



More information about the Xapian-discuss mailing list