[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator
Xapian
nobody at xapian.org
Fri Sep 25 11:01:02 BST 2009
#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
Reporter: richard | Owner: richard
Type: enhancement | Status: assigned
Priority: high | Milestone: 1.2.0
Component: QueryParser | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Old description:
> Some code to do this kind of tokenisation is now available at
> http://code.google.com/p/cjk-tokenizer/ which should probably be used as
> the
> basis for supporting this in Xapian.
>
> We could add this as a QueryParser/TermGenerator option without breaking
> API compatibility. Marking for considering later in 1.1.x, but it could
> probably go in 1.2.x as it's likely to be ABI compatible too.
New description:
Some code to do this kind of tokenisation is now available at
http://code.google.com/p/cjk-tokenizer/ which should probably be used as
the
basis for supporting this in Xapian.
We could add this as a !QueryParser/!TermGenerator option without breaking
API compatibility. Marking for considering later in 1.1.x, but it could
probably go in 1.2.x as it's likely to be ABI compatible too.
--
Comment(by olly):
Thanks for the patch - certainly a step forward.
There seem to be quite a lot of whitespace changes which make it harder to
read. Can you regenerate it adding {{{-bB}}} to the diff options?
The new header shouldn't be under "include/xapian", since that's for the
installed public API headers, but that's easy enough to fix.
It would be better to use Xapian's Unicode and UTF-8 support rather than
adding a dependency on glib. Not just because adding avoidable
dependencies is generally better, but also because there's scope for
getting confused results if glib and Xapian's routines give different
answers (as they might legitimately do if they are supporting different
Unicode versions, or if invalid UTF-8 sequences are encountered).
I think it's probably better to have the user select "CJKV-mode".
Exploding every string being indexed into a vector and then scanning it to
see if CJKV characters are present is going to add a lot of overhead to
everyone, even those indexing non-CJKV text. It also seems we don't want
to completely change how we index (e.g.) English text which a Chinese name
in. Alternatively, we could perhaps switch mode within a text string when
we hit CJKV, and switch back when we hit non-CJKV.
--
Ticket URL: <http://trac.xapian.org/ticket/180#comment:8>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list