[Xapian-discuss] Integrated Chinese tokenizer SCWS in xapian-core

Olly Betts olly at survex.com
Thu Sep 15 06:45:07 BST 2011


On Wed, Sep 14, 2011 at 01:40:25PM +0800, hightman wrote:
> Xapian is a very excellent open source search engine library,  but
> there is no native support for Chinese word segmentation in
> queryparser and termgenerator.

Actually, trunk now has code for a n-gram based approach, and there is
a GSoC project which has been working on adding support for segmentation
using dictionaries and other heuristics, but there is certainly room for
supporting multiple alternative approaches.

> Therefore, I modified small amount of source codes, integrated into
> the SCWS tokenizer, that is the same open-source and developped by
> myself. 

What licence is SCWS released under?  I couldn't find this information
anywhere - the nearest I came was the COPYING file in the distribution.
I tried converting this from BIG-5 to UTF-8, which gave plausible
looking Chinese text, but Google translate just gave gibberish when I
tried to convert the UTF-8 text to English to get the gist.

> Anyone can obtain the patch from below URL. After patching,
> Xapian::QueryParser::parse_query and Xapian::Termgenerator::index_text
> will support chinese words segmentation directly.
> 
> https://github.com/hightman/xunsearch/blob/master/xapian-scws/patch.xapian-core-scws

Thanks for the patch.

If you want to get this integrated into Xapian releases, we really need
a patch against trunk (this one won't apply cleanly, since it hooks in
to the same places as the new n-gram CJK code).

We also really need test coverage for the added code, so we know that
it actually works and to help ensure it isn't broken by future changes.

Also, please confirm that you're happy to license the patch suitably -
see "Licensing of patches" in HACKING:

http://trac.xapian.org/browser/trunk/xapian-core/HACKING#L1203

Cheers,
    Olly



More information about the Xapian-discuss mailing list