[Xapian-devel] Need Suggestions for Sentence Breaking Implementation
Olly Betts
olly at survex.com
Mon Jul 16 04:56:08 BST 2012
On Fri, Jul 13, 2012 at 06:09:38PM +0530, Sehaj Singh Kalra wrote:
> 4. An existing C++ API that does this is BreakIterator class present here -
> < http://icu-project.org/apiref/icu4c/classBreakIterator.html > .
> Here's a line from it's doc: "The text boundary positions are found
> according to the rules described in Unicode Standard Annex #29, Text
> Boundaries, and Unicode Standard Annex #14, Line Breaking Properties. These
> are available at < http://www.unicode.org/reports/tr14/ > and <
> http://www.unicode.org/reports/tr29/ > ."
ICU is rather a big dependency, but this is probably a good choice for
initial development as it means you can get on with the indexing part,
rather than spending time coding up the same Unicode algorithms from
scratch.
> 5. Somone suggested me to use PCRE (Perl Compatible Regular Expressions)
> Library < http://www.pcre.org/ > (though I don't know much about Perl) , to
> use Perl Based Regular Expressions and code them in C++ using PCRE. The
> wikipedia page above mentions some Perl Compatible Regular Expressions for
> sentence breaking. Other that that, some one also made a suggestion to use
> this Perl module <
> http://search.cpan.org/~achimru/Lingua-Sentence-1.00/lib/Lingua/Sentence.pm>
> and code it up it C++ using PCRE.
I wouldn't take the regular expression route. While they can probably
do a reasonable job at finding sentence boundaries, ultimately regular
expressions can't express everything you can with code, so at some point
you'll probably find you have to rewrite not to use regular expressions
anyway.
Cheers,
Olly
More information about the Xapian-devel
mailing list