[Xapian-devel] ????: How to add support of Chinese & Japanese

Olly Betts olly at survex.com
Thu Sep 1 14:35:50 BST 2011


On Thu, Jul 28, 2011 at 11:52:30AM +0800, Bruce Zhang wrote:
> As online materials said, seems Xapian is going to support CJK, 
> so what's current status of supporting Chinese(simplified, traditional)?
> what's current status of supporting Korean, Japanese respectively?

There's the n-gram approach (ticket#180) which should work for any of
these languages.  That's now merged to trunk and the 1.2 branch, but
you currently have to set an environment variable to enable it.

There's also the segmentation code for Chinese which Dai Youli has
been working on for GSoC, which we're hoping to get merged in fairly
soon too.

As far as I know, nobody has worked on adding specific support for
segmenting Japanese or Korean (there was a potential GSoC applicant
who was looking at Japanese, but they didn't apply in the end).

> I downloaded Xapian-core-1.2.6, xapian-omega-1,2,6, I saw from online
> document that Chinese Segment is in separate folder named segmentation,
>
> I wonder if Chinese segment code is in 1.2.6 or still beta release?

Neither of the approaches being worked on are in a release yet.

> how should I integrate segmentation code with xapian-core-1,2,6 and
> xapian-omega-1.2.6?

It'll need a fair bit of work to integrate it.  The places you'd
want to hook in are similar to where the n-gram CJK code hooks in
if you want to look into this.

> what kind of dictionary is used? who will maintain this dictionary to
> incorporate latest live words?

There's a dictionary of common words, and another for names.

The code is built to cope with having an incomplete dictionary, so it
shouldn't be vital to keep updating it, though hopefully it will evolve
with time.

It may also be possible to use the term list from the database as a
dynamic wordlist.

Cheers,
    Olly



More information about the Xapian-devel mailing list