[Xapian-devel] Bitsize project: Krovetz Stemmer

Sun Feb 15 22:34:07 GMT 2015

On Sun, Feb 15, 2015 at 05:05:11PM +0000, James Aylett wrote:
> How you then structure that in your code as you load it from file and
> later use it is entirely up to you. If it’s just a list of words that
> should be treated specially, having a class to represent each word
> feels like overkill — you can probably do it with something like an
> STL container of a base_string of some sort (std::wstring? I haven’t
> done much Unicode in C++ work, so others may want to jump in and
> correct me here).

Where xapian-core cares about the encoding, it deals with UTF-8 encoded
text, which we store as const char * or std::string.  Using std::wstring
would be appropriate if we were handling wide characters, but converting
UTF-8 to and from a wide character string is likely to end up
significantly slower.  The trade-off is that iterating a UTF-8 string is
more complex than a wide character strings - there it's a simple pointer
dereference and increment per Unicode character.

If you use std::string, that is a class and it represents each word,
which as James says might indeed be overkill.  If the stemming
dictionary is potentially very large, you might want to load the file
into a single allocated block of memory and then just use const char *
into that block for the words - that would avoid the overhead of
creating a huge number of std::string objects.

Cheers,
    Olly