[Xapian-devel] My Introduction and Ideas

Olly Betts olly at survex.com
Wed Feb 26 12:18:41 GMT 2014


On Wed, Feb 26, 2014 at 09:31:53AM +0800, Chu Bingxiang wrote:
> I have been focused on this project from Jan. These days, by searching
> through the Internet, I found that there seems few Chinese documents,
> and to Chinese, it dosen't work as well as the other languages. From
> the project websites, I heard that there is a Chinese student who did
> a Chinese segmentation analysis at GSoC 2011. Maybe it doesn't work
> well so Chinese prefered to use another segmentation module after it
> is published. 

Yes, in 2011 Dai Youli worked on implementing a segmentation algorithm
for Chinese.  We did a fairly thorough search first, and failed to find
anything existing in C or C++ which had a suitable licence, so writing
one seemed the only option.  It's quite a challenging project for the
GSoC timescale - it basically worked at the end, but there was quite a
bit more to do.

But shortly after that, someone submitted a patch for integration of
the SCWS Chinese segmentation library - this had been relicensed
under a more liberal licence since we'd looked for something suitable,
so we hadn't considered it before:

http://www.xunsearch.com/scws/

This is a working segmentation algorithm (or at least I'm told it is - I
don't understand Chinese well enough to tell for myself) and it's
actively maintained, which certainly beats having to maintain our own.

We've not managed to get the patch merged yet.  I did start to work
on cleaning it up, but then the author of the patch sent an updated
version of the patch, but not based on my cleaned up version of the
original, which rather put me off working on it for a while.  Sadly
I've not yet got back to it.

The original patch and my cleaned up version are in this ticket,
and I've just tracked down the newer patch and added that too:

http://trac.xapian.org/ticket/594

> To the project's idea, I think rebuild or improve that Chinese
> segmentation analysis module is a good idea.

The current status is far from ideal, and it would be good to move
things forwards.

We left off the "Improve Chinese Support" idea from the list this
year because there didn't seem to be enough to occupy a student for
3 months.  We need to merge the changes from the newer patch and
my cleaned up version of the older one, integrate everything nicely,
and make sure there's test coverage and documentation for it.

But if you'd like to work on that, you could combine that with something
else unrelated to make a suitable sized project - there's no reason the
project has to be all one thing.

> Also, translate the docs could be a good choice from now on to
> increase the influence of xapian in China.

Documentation in languages other than English would be great to have,
but translating documentation doesn't really fit with GSoC's rules:

http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2014/help_page#12._Are_proposals_for_documentation_work

But if you (or anyone else) wants to work on translations outside of
GSoC, I'd suggest the newer "Getting Started with Xapian" guide would be
the best document to work on:

http://getting-started-with-xapian.readthedocs.org/en/latest/

Cheers,
    Olly



More information about the Xapian-devel mailing list