[Xapian-discuss] Re: Japanese / UTF-8 support

Fabrice Colin fabrice.colin at gmail.com
Fri Aug 11 12:23:50 BST 2006


On 8/11/06, "Jeff Breidenbach" <breidenbach at gmail.com> wrote:
> > And what about automatic language detection?
> > That would help me also tremendously as I have about 60% english, 20%
> > german, 5% french, 5% korean, 5% japanese, and 5% italian.
>
> Not sure why this is useful, except perhaps for stemming. Even then
> you will be in trouble for mixed language documents. Seems a little
> outside the scope of Xapian, at least from my newbie perspective.
> Anyway, there's a couple n-gram based language detectors in open
> source land which work fairly well, but the error rate is noticible.
>
I am using libtextcat (http://software.wise-guys.nl/libtextcat/) for Pinot.
It's pretty accurate, at least with the few European languages I tried.
Korean and Japanese are supported too apparently...

Fabrice



More information about the Xapian-discuss mailing list