<div>Hi there,</div><div><br></div><div>My name is Julia Wilson and I'm a grad student in Computational Linguistics at Brandeis University. As a GSoC project I'm interested in improving Japanese language support, and I had a couple of questions for the application I'm putting together.</div>
<div><br></div><div>I know Japanese - I'm not a native speaker by any means, but I'm pretty good - and I'm really interested in the specifics of how dealing with Japanese text differs from dealing with English and other languages that use roman scripts. Currently I'm doing some research on improving Japanese to English translation algorithms using linguistic data (topicalization and pronoun resolution, specifically).</div>
<div><br></div><div>The suggested project description mentions switching to a language-specific segmentation algorithm; since a particular algorithm isn't specified I'm guessing that the evaluation and selection of an algorithm would necessarily be part of the project? I've worked with Japanese text mostly in Python and so I've used <a href="https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py" target="_blank">TinySegmenter in Python</a> when I needed segmentation; I've run across a couple of other options that were available in C++ but I haven't really had occasion to use them. So basically, I've dealt with Japanese segmenters before, but am interested to know if you had anything specific in mind.</div>
<div><br></div><div>I've been looking at the code in xapian-core/languages and xapian-core/queryparser to get an idea of how other languages are implemented and used. Am I correct in assuming that the ultimate goal is to deprecate the n-gram based tokenizer in favor of individual ways of handling Japanese and Korean (and presumably Chinese)? That is, is the idea to make language support entirely language-specific, or would there still be kind of a generic "CJK Stuff" class as well? Also, is there anything else beyond what I mentioned that I should make a point of looking at in the code base to better understand what would need to be done?</div>
<div><br></div><div>Thanks,</div><div>Julia</div>