[Xapian-devel] Adding support for phonetic correction via soundex algorithms
james-xapian at tartarus.org
Wed Dec 31 14:42:35 GMT 2014
Shubham — please keep replies on the mailing list, to more people can help and benefit.
On 31 Dec 2014, at 04:55, Shubham Sharma <shubham8742 at gmail.com> wrote:
>> For the spelling correction side you'll want to consider both implementing a phonetic similarity comparison (along with justification for your choice) and a way of choosing between edit distance and phonetic approaches. You’ll also want to come up with a way of measuring how good each is, or at least for providing people with a way of deciding which approach is more suitable for the system they’re building using Xapian.
> I'm afraid, I beg to differ a bit . The selected or detected language of the query might be taken into account, Don't you think spelling correction via soundex algorithm would be language dependent?
I’m not excluding that as a possibility, and certainly these sorts of considerations should be taken into account when writing your proposal. It’s worth pointing out that Soundex is English-specific, and intended for use on surnames (there are variants for non-English names, and apparently Kölner Phonetik might be suitable for general corpuses in German). Metaphone and double metaphone are generally better than Soundex for general corpuses, although still principally designed for use with American English sources. (There’s also Metaphone 3, which I haven’t used personally and isn’t available freely so would be inappropriate for inclusion in Xapian.)
>> Unless it turns out to be more complex than I’d anticipate, that doesn’t sound like a project that would occupy you for the whole period of GSoC. It could however be combined with something else to make a good application.
> Considering the fact that the number of sounds we can make is limited. The best approach would be to perform a mapping of alphabets of all the 10-12 supported languages to a unified mapping (lets say to english).
If you’re going to map languages (via pronunciation rules?) to a unified system, maybe something based on IPA would be appopriate (since it genuinely is unified, and would support future languages as well). Pronunciation, like stemming, cannot be attacked “perfectly” via an algorithm, but there may be good approaches that combine algorithms with pre-computed exceptions lists (I haven’t looked).
> Followed by implementing the phonetic algorithm (the most popular one being soundex) . Then would be the part of combining it with edit distance. Instead of a weighing scheme i would suggest that first apply soundex (which would improve recall) , then followed by calculating edit distance of queried word to suggested words. (improving precision). Thus giving us the best spell correction results overall.
It sounds to me like you’re proposing implementing Soundex for use by Xapian::Stem (it’s not actually a stemmer, but it would fit at the same place in the pipeline), which while possible would be a fairly small project. It might be interesting to have a number of phonetic systems available like this, and to evaluate their strength in improving recall against the existing language algorithmic stemmers in various situations. (Again, double metaphone is probably more useful than Soundex.) We also might want to be able to chain stemmers (so you get a Xapian::Stem object that will run a word through multiple stemmers in turn), and compare recall against language stemmer + phonetic conversion.
However this doesn’t touch how Xapian currently does spelling correction — is there a reasonable way of using a phonetic algorithm in place of the edit distance? (I’m not convinced that our spelling correction currently improves precision; are you thinking of using edit distance differently?)
Weighting schemes come later than all this, so they’ll still be run (unless you’re proposing a replacement weighting scheme based on spelling correction somehow, but I’m not clear how that would work).
> I guess that's a little bit different than you anticipated so I'd like to know if you now feel it would be good nough for the GSoC period?
Without seeing a concrete proposal it’s impossible for me to really judge; there are things you’re saying that could result in more work, but I’m not clear in all cases whether they’re a suitable approach.
James Aylett, occasional trouble-maker
More information about the Xapian-devel