[Xapian-discuss] Ordering search results and defining a custom Weight class in python

Mon Jun 2 22:44:07 BST 2008

On Mon, Jun 02, 2008 at 12:19:58PM -0700, Robert Kaye wrote:
> I've asked for some help testing our new search service and that has  
> turned up that we're having problems properly tokenizing Chinese text.  
> Our database can conceivably have text from all languages supported by  
> Unicode and we'd need to find a way to properly tokenize chinese text.  
> I've seen a few posts from last year talking about a Chinese  
> tokenization scheme, but I haven't found anything about that in the  
> official docs.

The current state is that Chinese characters are interpreted as word
characters pretty much the same way that A-Z, 0-9, etc are.  So text
which consists of such characters without spaces doesn't really work
very well.

I'd like to add support for a better indexing/searching approach for
Chinese (and other languages which work in a similar way).  Someone
provided some standalone code for tokenising Chinese (which is probably
what you were looking at in the archives), so it's mostly a matter of
integrating this, or using it as a model for implementing something
similar if it isn't a good fit.

> Is there a preferred way (in python) to handle the tokenization of  
> Chinese characters?

I'm not aware of one.

But a simple hack which might help for now is to insert a space between
any two adjacent Chinese characters before indexing or searching.
Particularly for the short "documents" you're looking at, that should
work pretty well.

Cheers,
    Olly