[Xapian-discuss] n-gram / cjk serializer

Fabrice Colin fabrice.colin at gmail.com
Wed Aug 20 13:14:19 BST 2008


On Wed, Aug 20, 2008 at 7:00 PM,  Joss Shaw <jossblowing at yahoo.co.uk> wrote:
> I've been trawling through the archives and I found reference to an n-gram query parser plugin
> which some guy made.  I don't think it's been included into the main Xapian distro yet but I would
> be really interested in such a tokenizer if there were plans!
>
> His tokenizer apparently plugs into Xapian, but I'm not sure how you plug extra query parsing
> engines in - could someone possibly shed some light on this for me please?  Additionally,
> would any plugin be able to take advantage of the term prefixes? Or is that something that would
> need to be reimplemented with each query parsing / tokenizing engine ?
>
> The guy put all the code here: http://code.google.com/p/cjk-tokenizer/
>
The guy in question is Yung-Chung Lin.

I am using a slightly modified version of his CJKV tokenizer in Pinot
to pre-process queries before feeding them to the QueryParser. I chose
this route because I didn't want to implement my own query parser and
wanted something that works with "mixed" queries.

Look for the QueryModifier class here :
http://svn.berlios.de/wsvn/pinot/trunk/IndexSearch/Xapian/XapianEngine.cpp
The CJKVTokenizer class is here :
http://svn.berlios.de/wsvn/dijon/trunk/cjkv/CJKVTokenizer.cc

For instance, the query "你身体好吗  title:妈妈" will become this :
(你 你身 身 身体 体 体好 好 好吗 吗) title:妈 title:妈妈 title:妈

Altogether it seems to work quite well. Of course, any bug is mine not
Yung-Chung's :-)

I hope this helps.

Fabrice


More information about the Xapian-discuss mailing list