[Xapian-devel] Moving indextext.cc into core.

Wed Mar 28 17:53:05 BST 2007

One of the items on the ToDo list for version 1.0 at 
http://wiki.xapian.org/TodoFor1_2e0#preview is:

"Rework Omega's indextext.cc as a xapian-core "TextSplitter" class."

I've been wondering about this for a while now.  Currently, we have the 
Query Parser in Xapian core, but no text processing.  Clearly, it makes 
sense to have a "text splitter" class in whichever library the query 
parser is in, since the query parser is hard to use correctly without 
compatible text processing, so doing this would be a step in the right 
direction.  The question in my mind is whether either of them belong in 
the core Xapian library, which is otherwise agnostic about the contents 
of the document supplied to it.

[Actually, I'm not sure that "text splitter" is the right name for what 
the code in indextext.cc does - it doesn't just split text, but also 
does stemming, creates "R" terms, and possibly a few other things I've 
missed.  I'd call it a "TextProcessor" class, but someone else might 
have a better name.]

A cleaner separation and code organisation, to my mind, would be to make 
a new intermediate library which sits on top of Xapian, and provides 
language specific processing features.  The stemming algorithm stuff 
would also be moved into this library.  So, we would end up with:

Xapian-Core: lowest level code - doesn't care about what the documents 
and terms it handles are.

Xapian-Text: text handling code - contains routines to generate terms 
and documents from pieces of text, both for searching and for indexing.

We would then move omega to use Xapian-Text instead of having its own 
text processing code, and then all applications built on Xapian could 
use this code if they want it, and just link directly to Xapian-Core if 
they only need the core library.

Having a new library for just the query parser and the indextext.cc code 
might seem a bit overkill - but I think there's rather a lot of extra 
stuff which would belong in this middle layer library.  For example:

  - the stemming algorithms.
  - stopwording algorithms.
  - date parsing and term generation.
  - standard match deciders for doing things like value range
    restrictions, or sort comparison functions.
  - automatic language detection code.
  - fuzzy matching code (eg, metaphone implementations, trigram matching
    implementations).
  - spelling correction algorithms.

I'm don't think we'd necessarily a new top-level module for this code; 
doing so would make the separation more obvious, but would require a bit 
more work than just fiddling with the build system in the xapian-core 
module to produce two libraries.  The most important thing would be to 
make sure that no header files contain declarations for both of the 
libraries, but I don't think any currently would.

Anyway - I probably have a little time over the next couple of days to 
dedicate to this, so comments would be welcomed.  If nothing else, I can 
implement a patch for the "Rework Omega's indextext.cc as a xapian-core 
"TextSplitter" class." task (so if anyone else is already working on 
this, shout now!).

-- 
Richard