[Xapian-discuss] UTF8 support plans (without stemming)

Olly Betts olly at survex.com
Wed Jun 29 12:26:06 BST 2005


On Wed, Jun 29, 2005 at 10:03:04AM +0100, Richard Boulton wrote:
> Just mailing to make sure you know that snowball now supports UTF-8
> quite happily.

Yes, I saw Martin's mail about it.

> It shouldn't be much work to use them - just replace the
> contents of the languages/ directory with the libstemmer_c library from
> http://snowball.tartarus.org/dist/libstemmer_c.tgz
> and then reimplement omstem.cc using them.

Problem is that some of the stemming algorithms have changed since the
versions we currently use, so simply upgrading is awkward because it
means that searches of existing databases won't work for terms which
are now stemmed differently.

So this isn't really something to change in a point release.

> Out of interest, why are all the snowball_* files in the languages
> directory compiled as C++?

This NEWS file entry sums it up well:

* Change the small number of C sources we have to be C++ so we can compile
  everything with the C++ compiler.  This way we don't need to worry about
  configure choosing a mismatching pair of compilers, or about whether
  configure tests with the C compiler don't apply to the C++ compiler, or vice
  versa.

We really don't need to use the few parts of C which aren't in C++'s C-like
subset so it shouldn't be an onerous restriction.  In fact most of the
differences are poor style in modern C anyway (e.g. not using prototypes).

Cheers,
    Olly



More information about the Xapian-discuss mailing list