[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.
☼ 林永忠 ☼ (Yung-chung Lin, a.k.a. "Kaspar" or "xern")
henearkrxern at gmail.com
Fri Jun 29 03:15:53 BST 2007
A ready-to-use bigram CJK tokenizer is attached to this mail. Enjoy it. Thanks.
Best,
Yung-chung Lin
On 6/6/07, Kevin Duraj <kevin.softdev at gmail.com> wrote:
> Hi,
>
> I am looking for Chinese Japanese and Korean tokenizer that could can
> be use to tokenize terms for CJK languages. I am not very familiar
> with these languages however I think that these languages contains one
> or more words in one symbol which it make more difficult to tokenize
> into searchable terms.
>
> Lucene has CJK Tokenizer ... and I am looking around if there is some
> open source that we could use with Xapian.
>
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
>
> Cheers
> -Kevin Duraj
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
-------------- next part --------------
#ifndef __TOKENIZER_H__
#define __TOKENIZER_H__
#include <string>
#include <vector>
#include <unicode.h>
namespace cjk {
enum tokenizer_type {
TOKENIZER_DEFAULT,
TOKENIZER_UNIGRAM
};
class tokenizer {
private:
enum tokenizer_type _type;
inline void _convert_unicode_to_char(unicode_char_t &uchar,
unsigned char *p);
public:
tokenizer();
tokenizer(enum tokenizer_type type);
~tokenizer();
void tokenize(std::string &str,
std::vector<std::string> &token_list);
void tokenize(char *buf, size_t buf_len,
std::vector<std::string> &token_list);
void split(std::string &str,
std::vector<std::string> &token_list);
void split(char *buf, size_t buf_len,
std::vector<std::string> &token_list);
};
};
#endif
More information about the Xapian-discuss
mailing list