[Xapian-discuss] Getting spelling to work

James Aylett james-xapian at tartarus.org
Tue Jan 8 21:22:05 GMT 2008


On Tue, Jan 08, 2008 at 09:15:05PM +0000, James Aylett wrote:

> > Then, most importantly, how does one then populate the spelling
> > dictionary when indexing documents?  Since every time you do
> > add_spelling() the frequency is incremented; what happens if I
> > want to re-index some document (or remove a document)?  For
> > the terms and postings, this is a valid thing to do.  Re-indexing
> > a document as many times as you want doesn't change things.
> > But if you're also adding it's terms to the spellings, then re-indexing
> > can seriously skew the frequencies it would seem.
> 
> Umm, no idea. Richard?

You could iterate over the (non-stemmed) terms in your document before
replacement, calling Xapian::WritableDatabase::remove_spelling()
appropriately.

However it may not actually be a problem. The idea of the frequencies
in the spelling dictionary is to model the language of your database;
this works on relative frequencies, so unless the distribution of
documents you are reindexing doesn't share the word distribution of
your corpus, you're not going to skew things enough to worry. Where
you might run into problems is with technical jargon, where some might
end up with larger weights than others out of proportion to their
usage (if for some reason certain uncommon words occurred in
frequently-updated documents where common words didn't - this is
unlikely but not impossible). In that situation using a fixed
dictionary of words for spelling correction may be better for you
anyway.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list