[Xapian-discuss] Spelling based on frequency and not just distance

Philip Neustrom philipn at gmail.com
Thu Jan 17 03:33:01 GMT 2008


While we're on the discussion, another thing to consider is query context.
The code right now suggests on a word-by-word basis, whereas ideally we want
to make suggestions based on other words in the query.

I haven't used the QueryParser before (I should!) but I think its
get_corrected_query_string() looks like a good place to make this happen.
What we're doing right now could probably be implemented in Xapian directly
with good results (
http://sycamore.devjavu.com/browser/trunk/Sycamore/search.py?rev=928#L433)

--Philip

On Jan 16, 2008 7:05 PM, Olly Betts <olly at survex.com> wrote:

> On Tue, Jan 15, 2008 at 01:24:33AM -0800, Philip Neustrom wrote:
> > After implementing the new spelling functionality on http://wikispot.orgI
> > noticed that terms like "wikipeda" weren't yielding spelling
> suggestions.
> > Taking a quick look at the code, it looks like if we find an exact
> match,
> > even if it has a frequency less than another match within the provided
> > delta, we don't suggest anything.  This is probably fine for sites with
> > documents where you can be assured the data is properly spelled -- but
> not
> > suitable for something like a wiki or the web in general.
>
> I'm not sure I believe there's any non-trivial collection of documents
> without spelling mistakes, but certainly the current spelling correction
> can be problematic.  The current scheme actually does OK when
> misspelling is rampant, then the incorrect spelling will typically
> return a useful set of results too.
>
> For example, searching Google for "wikipeda" finds "about 64,400"
> results, and a quick look suggest most are relevant.  The sheer size of
> the index helps here too.
>
> It's not just misspellings in documents which are problematic with the
> current scheme.  A genuine word which is also a typo for another word
> is too.  In some cases, both words are common and not a lot can really
> be done anyway (e.g. "biking" vs "bikini").  In others, one word is
> common and the other sufficiently obscure that it's unlikely to be what
> the user meant to search for (e.g. "agent" vs "ahent" - I don't actually
> even know what "ahent" means - I'd guess it's related to "hent" - but
> it's a valid play in Scrabble!)
>
> Some heuristic based on the relative frequencies (and possibly something
> like the average frequency, which I don't think we currently know but
> could track easily enough) seems like a good approach.
>
> Another source of spelling information is logs - what offered spelling
> corrections users have previously accepted is obviously interesting, but
> also just what has been searched for and whether the user performed
> another search within a short time interval may be a source of useful
> information.  It's hard to see how exactly to feed such information
> in though.
>
> > The patch attached to this email is better than the previous.  Hopefully
> > somebody can come up with something better entirely, as I'm not totally
> > happy with what I have -- it tends to suggest things like "plant" for
> > "plants" and then "plan" for "plant" :)
>
> That sounds rather undesirable though.
>
> Probably the best thing to do is open a bug and attach your patch so it
> doesn't just get forgotten.
>
> Cheers,
>    Olly
>


More information about the Xapian-discuss mailing list