[Xapian-devel] Moving indextext.cc into core.
Richard Boulton
richard at lemurconsulting.com
Wed Mar 28 17:53:05 BST 2007
One of the items on the ToDo list for version 1.0 at
http://wiki.xapian.org/TodoFor1_2e0#preview is:
"Rework Omega's indextext.cc as a xapian-core "TextSplitter" class."
I've been wondering about this for a while now. Currently, we have the
Query Parser in Xapian core, but no text processing. Clearly, it makes
sense to have a "text splitter" class in whichever library the query
parser is in, since the query parser is hard to use correctly without
compatible text processing, so doing this would be a step in the right
direction. The question in my mind is whether either of them belong in
the core Xapian library, which is otherwise agnostic about the contents
of the document supplied to it.
[Actually, I'm not sure that "text splitter" is the right name for what
the code in indextext.cc does - it doesn't just split text, but also
does stemming, creates "R" terms, and possibly a few other things I've
missed. I'd call it a "TextProcessor" class, but someone else might
have a better name.]
A cleaner separation and code organisation, to my mind, would be to make
a new intermediate library which sits on top of Xapian, and provides
language specific processing features. The stemming algorithm stuff
would also be moved into this library. So, we would end up with:
Xapian-Core: lowest level code - doesn't care about what the documents
and terms it handles are.
Xapian-Text: text handling code - contains routines to generate terms
and documents from pieces of text, both for searching and for indexing.
We would then move omega to use Xapian-Text instead of having its own
text processing code, and then all applications built on Xapian could
use this code if they want it, and just link directly to Xapian-Core if
they only need the core library.
Having a new library for just the query parser and the indextext.cc code
might seem a bit overkill - but I think there's rather a lot of extra
stuff which would belong in this middle layer library. For example:
- the stemming algorithms.
- stopwording algorithms.
- date parsing and term generation.
- standard match deciders for doing things like value range
restrictions, or sort comparison functions.
- automatic language detection code.
- fuzzy matching code (eg, metaphone implementations, trigram matching
implementations).
- spelling correction algorithms.
I'm don't think we'd necessarily a new top-level module for this code;
doing so would make the separation more obvious, but would require a bit
more work than just fiddling with the build system in the xapian-core
module to produce two libraries. The most important thing would be to
make sure that no header files contain declarations for both of the
libraries, but I don't think any currently would.
Anyway - I probably have a little time over the next couple of days to
dedicate to this, so comments would be welcomed. If nothing else, I can
implement a patch for the "Rework Omega's indextext.cc as a xapian-core
"TextSplitter" class." task (so if anyone else is already working on
this, shout now!).
--
Richard
More information about the Xapian-devel
mailing list