[Xapian-devel] Need Suggestions for Sentence Breaking Implementation

Fri Jul 13 13:39:38 BST 2012

Hi,

I have been working on developing Link Grammar interface, so as to use POS
tagging while indexing the documents.

The interface header as well as implementation file have been completed and
you can view them at <
https://github.com/sehaj-sk/xapian/commit/052d634e1986bcf5607e43f52ac3e07646920196>
and <
https://github.com/sehaj-sk/xapian/commit/3015223662986d7a180d77101d6f4664f6552144>
respectively.

After that I have tried to use that for indexing the documents. Here's the
code that does that implements the Link Grammar interface for POS tagged
indexing in termgenerator.  <
https://github.com/sehaj-sk/xapian/commit/4ed8e505b44581fcc038598ec0b7cd011e42f8da>

I have added a simple example in the xapian-core/examples directory, that
shows the outcome and results of this feature. The example is present at <
https://github.com/sehaj-sk/xapian/commit/75c2e4749e9084fca5f390b88d565cb117e90d38>

At present it is capable of indexing only single sentences.
So to index a large text, I need to break it into sentences.
So I need suggestions for doing the Sentence Boundary Disambiguation.

Please suggest any paper/algorithm that could be coded or any existing
library that can be used.
The focus at present is on English language only.

I have done some searching and here's what I have found -

1. Here's an article on Wikipedia that mentions about it and the solutions
available. < http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation >
2. There are not many available solutions in C/C++. Almost all of them are
either in Python or Java.
3. There's a sentence boundary detection algorithm defined by Unicode
Standard. It's present at <
http://www.unicode.org/reports/tr29/#Sentence%5FBoundaries >
4. An existing C++ API that does this is BreakIterator class present here -
< http://icu-project.org/apiref/icu4c/classBreakIterator.html > .
Here's a line from it's doc:  "The text boundary positions are found
according to the rules described in Unicode Standard Annex #29, Text
Boundaries, and Unicode Standard Annex #14, Line Breaking Properties. These
are available at < http://www.unicode.org/reports/tr14/ > and <
http://www.unicode.org/reports/tr29/ > ."
5. Somone suggested me to use PCRE (Perl Compatible Regular Expressions)
Library < http://www.pcre.org/ > (though I don't know much about Perl) , to
use Perl Based Regular Expressions and code them in C++ using PCRE. The
wikipedia page above mentions some Perl Compatible Regular Expressions for
sentence breaking. Other that that, some one also made a suggestion to use
this Perl module <
http://search.cpan.org/~achimru/Lingua-Sentence-1.00/lib/Lingua/Sentence.pm>
and code it up it C++ using PCRE.
6. There are also a lot of research papers written for this problem.

Looking forward for quick guidance and suggestions.

Thanks,
Sehaj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120713/d9ab623a/attachment.htm>