[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Fri Jun 29 03:15:53 BST 2007

A ready-to-use bigram CJK tokenizer is attached to this mail. Enjoy it. Thanks.

Best,
Yung-chung Lin

On 6/6/07, Kevin Duraj <kevin.softdev at gmail.com> wrote:
> Hi,
>
> I am looking for Chinese Japanese and Korean tokenizer that could can
> be use to tokenize terms for CJK languages. I am not very familiar
> with these languages however I think that these languages contains one
> or more words in one symbol which it make more difficult to tokenize
> into searchable terms.
>
> Lucene has CJK Tokenizer ... and I am looking around if there is some
> open source that we could use with Xapian.
>
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
>
> Cheers
>   -Kevin Duraj
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
-------------- next part --------------
#ifndef __TOKENIZER_H__
#define __TOKENIZER_H__

#include <string>
#include <vector>
#include <unicode.h>

namespace cjk {
    enum tokenizer_type {
        TOKENIZER_DEFAULT,
        TOKENIZER_UNIGRAM
    };

    class tokenizer {
        private:
        enum tokenizer_type _type;
        inline void _convert_unicode_to_char(unicode_char_t &uchar,
                                             unsigned char *p);
        public:
        tokenizer();
        tokenizer(enum tokenizer_type type);
        ~tokenizer();
        void tokenize(std::string &str,
                      std::vector<std::string> &token_list);
        void tokenize(char *buf, size_t buf_len,
                      std::vector<std::string> &token_list);
        void split(std::string &str,
                   std::vector<std::string> &token_list);
        void split(char *buf, size_t buf_len,
                   std::vector<std::string> &token_list);
    };
};

#endif