[Xapian-discuss] Ordering search results and defining a custom Weight class in python

Olly Betts olly at survex.com
Mon Jun 2 22:44:07 BST 2008


On Mon, Jun 02, 2008 at 12:19:58PM -0700, Robert Kaye wrote:
> I've asked for some help testing our new search service and that has  
> turned up that we're having problems properly tokenizing Chinese text.  
> Our database can conceivably have text from all languages supported by  
> Unicode and we'd need to find a way to properly tokenize chinese text.  
> I've seen a few posts from last year talking about a Chinese  
> tokenization scheme, but I haven't found anything about that in the  
> official docs.

The current state is that Chinese characters are interpreted as word
characters pretty much the same way that A-Z, 0-9, etc are.  So text
which consists of such characters without spaces doesn't really work
very well.

I'd like to add support for a better indexing/searching approach for
Chinese (and other languages which work in a similar way).  Someone
provided some standalone code for tokenising Chinese (which is probably
what you were looking at in the archives), so it's mostly a matter of
integrating this, or using it as a model for implementing something
similar if it isn't a good fit.

> Is there a preferred way (in python) to handle the tokenization of  
> Chinese characters?

I'm not aware of one.

But a simple hack which might help for now is to insert a space between
any two adjacent Chinese characters before indexing or searching.
Particularly for the short "documents" you're looking at, that should
work pretty well.

Cheers,
    Olly



More information about the Xapian-discuss mailing list