[Xapian-tickets] [Xapian] #225: Spelling algorithm should consider frequency and not just edit-distance
Xapian
nobody at xapian.org
Sun Aug 1 11:37:48 BST 2010
#225: Spelling algorithm should consider frequency and not just edit-distance
-------------------------+--------------------------------------------------
Reporter: philipn | Owner: olly
Type: defect | Status: assigned
Priority: high | Milestone: 1.2.x
Component: Library API | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Comment(by olly):
It seem trac messes up the UTF-8 characters when previewing that rst file
- view the rst file itself for a better version.
Thinking about it, p is going to vary by user, but we're probably talking
something like 0.01 to 0.001. It might well be the actual best value
isn't the same as the true probability (since we make various simplifying
assumptions) so perhaps it is best to tune p for best results rather than
try too hard to determine a "true" value.
To efficiently implement this model, it would be useful to track an upper
bound on the spelling frequency, which is easy to do, but we don't
currently, and seems like it will need an incompatible database format
change.
But it's easy to address the specific point about not returning any
correction if the word is in the spelling dictionary (as it may be if
misspelled in the indexed documents) - I've addressed that in trunk
r14859.
--
Ticket URL: <http://trac.xapian.org/ticket/225#comment:10>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list