<div>Hi,</div><div><br></div><div>I have been working on developing Link Grammar interface, so as to use POS tagging while indexing the documents.</div><div><br></div><div>The interface header as well as implementation file have been completed and you can view them at < <a href="https://github.com/sehaj-sk/xapian/commit/052d634e1986bcf5607e43f52ac3e07646920196">https://github.com/sehaj-sk/xapian/commit/052d634e1986bcf5607e43f52ac3e07646920196</a> > and < <a href="https://github.com/sehaj-sk/xapian/commit/3015223662986d7a180d77101d6f4664f6552144">https://github.com/sehaj-sk/xapian/commit/3015223662986d7a180d77101d6f4664f6552144</a> > respectively.</div>
<div><br></div><div>After that I have tried to use that for indexing the documents. Here's the code that does that implements the Link Grammar interface for POS tagged indexing in termgenerator. < <a href="https://github.com/sehaj-sk/xapian/commit/4ed8e505b44581fcc038598ec0b7cd011e42f8da">https://github.com/sehaj-sk/xapian/commit/4ed8e505b44581fcc038598ec0b7cd011e42f8da</a> ></div>
<div><br></div><div>I have added a simple example in the xapian-core/examples directory, that shows the outcome and results of this feature. The example is present at < <a href="https://github.com/sehaj-sk/xapian/commit/75c2e4749e9084fca5f390b88d565cb117e90d38">https://github.com/sehaj-sk/xapian/commit/75c2e4749e9084fca5f390b88d565cb117e90d38</a> ></div>
<div><br></div><div>At present it is capable of indexing only single sentences.</div><div>So to index a large text, I need to break it into sentences.</div><div>So I need suggestions for doing the Sentence Boundary Disambiguation.</div>
<div><br></div><div>Please suggest any paper/algorithm that could be coded or any existing library that can be used.</div><div>The focus at present is on English language only.</div><div><br></div><div><br></div><div>I have done some searching and here's what I have found -</div>
<div><br></div><div>1. Here's an article on Wikipedia that mentions about it and the solutions available. < <a href="http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation">http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation</a> ></div>
<div>2. There are not many available solutions in C/C++. Almost all of them are either in Python or Java.</div><div>3. There's a sentence boundary detection algorithm defined by Unicode Standard. It's present at < <a href="http://www.unicode.org/reports/tr29/#Sentence%5FBoundaries">http://www.unicode.org/reports/tr29/#Sentence%5FBoundaries</a> ></div>
<div>4. An existing C++ API that does this is BreakIterator class present here - < <a href="http://icu-project.org/apiref/icu4c/classBreakIterator.html">http://icu-project.org/apiref/icu4c/classBreakIterator.html</a> > .</div>
<div>Here's a line from it's doc: "The text boundary positions are found according to the rules described in Unicode Standard Annex #29, Text Boundaries, and Unicode Standard Annex #14, Line Breaking Properties. These are available at < <a href="http://www.unicode.org/reports/tr14/">http://www.unicode.org/reports/tr14/</a> > and < <a href="http://www.unicode.org/reports/tr29/">http://www.unicode.org/reports/tr29/</a> > ."</div>
<div>5. Somone suggested me to use PCRE (Perl Compatible Regular Expressions) Library < <a href="http://www.pcre.org/">http://www.pcre.org/</a> > (though I don't know much about Perl) , to use Perl Based Regular Expressions and code them in C++ using PCRE. The wikipedia page above mentions some Perl Compatible Regular Expressions for sentence breaking. Other that that, some one also made a suggestion to use this Perl module < <a href="http://search.cpan.org/~achimru/Lingua-Sentence-1.00/lib/Lingua/Sentence.pm">http://search.cpan.org/~achimru/Lingua-Sentence-1.00/lib/Lingua/Sentence.pm</a> > and code it up it C++ using PCRE.</div>
<div>6. There are also a lot of research papers written for this problem.</div><div><br></div><div>Looking forward for quick guidance and suggestions.</div><div><br></div><div>Thanks,</div><div>Sehaj</div>