[Xapian-devel] GSoC - Improve Japanese Support
Julia Wilson
jlwilson at brandeis.edu
Thu Mar 29 02:51:50 BST 2012
Hi there,
My name is Julia Wilson and I'm a grad student in Computational Linguistics
at Brandeis University. As a GSoC project I'm interested in improving
Japanese language support, and I had a couple of questions for the
application I'm putting together.
I know Japanese - I'm not a native speaker by any means, but I'm pretty
good - and I'm really interested in the specifics of how dealing with
Japanese text differs from dealing with English and other languages that
use roman scripts. Currently I'm doing some research on improving Japanese
to English translation algorithms using linguistic data (topicalization and
pronoun resolution, specifically).
The suggested project description mentions switching to a language-specific
segmentation algorithm; since a particular algorithm isn't specified I'm
guessing that the evaluation and selection of an algorithm would
necessarily be part of the project? I've worked with Japanese text mostly
in Python and so I've used TinySegmenter in
Python<https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py>
when
I needed segmentation; I've run across a couple of other options that were
available in C++ but I haven't really had occasion to use them. So
basically, I've dealt with Japanese segmenters before, but am interested to
know if you had anything specific in mind.
I've been looking at the code in xapian-core/languages and
xapian-core/queryparser to get an idea of how other languages are
implemented and used. Am I correct in assuming that the ultimate goal is to
deprecate the n-gram based tokenizer in favor of individual ways of
handling Japanese and Korean (and presumably Chinese)? That is, is the idea
to make language support entirely language-specific, or would there still
be kind of a generic "CJK Stuff" class as well? Also, is there anything
else beyond what I mentioned that I should make a point of looking at in
the code base to better understand what would need to be done?
Thanks,
Julia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120328/fd3e8871/attachment.htm>
More information about the Xapian-devel
mailing list