[Xapian-discuss] using xapian for indexing mails [SOLVED]

Rusty Conover rconover at infogears.com
Mon Sep 1 10:57:29 BST 2008


>
> I noticed that the stemming is language-specific (understandably); is
> there some recommended way to guess the language of a blob of text?  
> For
> me, speed is more important than 100% accuracy (which would be hard
> anyway, and consider multi-language text etc...)

n-gram analysis works pretty well..

In a nutshell it works like this:

Step 1. Training: With sample texts in various languages by produce n- 
grams, keep the most popular N n-grams for each language where N is  
sufficiently large.
Step 2. Analysis:  Compare the number of matching of n-grams from the  
unknown language text to the n-gram samples from each language.  The  
language with the most matches is probably the language of that text.

See:
http://www.rubyinside.com/whatlanguage-ruby-language-detection-library-1085.html
http://code.activestate.com/recipes/326576/

Regards,

Rusty
--
Rusty Conover
InfoGears Inc. / www.GearBuyer.com / www.FootwearBuyer.com
http://www.infogears.com









More information about the Xapian-discuss mailing list