[Xapian-discuss] using xapian for indexing mails [SOLVED]
Rusty Conover
rconover at infogears.com
Mon Sep 1 10:57:29 BST 2008
>
> I noticed that the stemming is language-specific (understandably); is
> there some recommended way to guess the language of a blob of text?
> For
> me, speed is more important than 100% accuracy (which would be hard
> anyway, and consider multi-language text etc...)
n-gram analysis works pretty well..
In a nutshell it works like this:
Step 1. Training: With sample texts in various languages by produce n-
grams, keep the most popular N n-grams for each language where N is
sufficiently large.
Step 2. Analysis: Compare the number of matching of n-grams from the
unknown language text to the n-gram samples from each language. The
language with the most matches is probably the language of that text.
See:
http://www.rubyinside.com/whatlanguage-ruby-language-detection-library-1085.html
http://code.activestate.com/recipes/326576/
Regards,
Rusty
--
Rusty Conover
InfoGears Inc. / www.GearBuyer.com / www.FootwearBuyer.com
http://www.infogears.com
More information about the Xapian-discuss
mailing list