[Xapian-discuss] xapian supports Chinese language

LiYong sdliyong at gmail.com
Wed Apr 8 14:19:55 BST 2009


----- Original Message ----- 
From: "Olly Betts" <olly at survex.com>
To: "Li Yong" <sdliyong at gmail.com>
Cc: <xapian-discuss at lists.xapian.org>
Sent: Wednesday, April 08, 2009 7:36 PM
Subject: Re: [Xapian-discuss] xapian supports Chinese language


> On Wed, Apr 08, 2009 at 05:08:31PM +0800, Li Yong wrote:
>> I want to use xapian to index chinese html pages.
>>
>> I found the cjk-tokenizer lib in the maillist
>> http://lists.tartarus.org/pipermail/xapian-discuss/2007-June/003921.html
>>
>> However, I do not know how to add this lib to the xapian project.
>
> That's just a link to the one in Lucene.
>
> This one might be more useful:
>
> http://thread.gmane.org/gmane.comp.search.xapian.general/4574/focus=4762
>
>> Is there any example or steps?
>
> I've not tried to use it myself.
>
> The longer term plan is to include this or something similar in Xapian
> itself, but nobody is currently working on it as far as I know.
>
> For now, I think you'd have to just ignore Xapian::TermGenerator and
> Xapian::QueryParser and add the bigram terms with add_posting() when
> indexing and combine them into queries with OP_AND.
>
> Cheers,
>    Olly
>

Thank you for your mail.



I have read the posts on 
http://thread.gmane.org/gmane.comp.search.xapian.general/4574/focus=4762



Since I am a newbie about Xapian, I read the simpleindex.cc and 
simplesearch.cc in examples folder and make the Omega application for a 
test. I want to modify the Omega application to index Chinese English mixed 
html pages.



First I want to modify the simple examples. Based on my understanding, the 
simpleindex.cc uses index_text function to add the input. However, I do not 
know how to recognize the input word by word. For example, if the input is 
"This is a test", the API should add these words this, is, a and test. Is it 
using the space to split them?  Chinese sentence does not contain space; I 
want to use the cjk lib to split the Chinese English mixed input. If I use 
the add_posting function, I have to split the input using the cjk lib and 
the pass the spited word to the app_posting function.  However,



If the input is html format, I have to parse the html tag; can I modify some 
functions in the Omega application?



Sorry for the long mail, these are just my ideas about so please correct me 
if I am wrong.



Thank you again!




Li Yong








More information about the Xapian-discuss mailing list