[Xapian-discuss] Different Collation (utf8_slovak_ci, utf8_danish_ci, latin1_german1_ci) etc.

Olly Betts olly at survex.com
Thu Mar 2 20:12:24 GMT 2006


On Thu, Mar 02, 2006 at 10:55:22AM -0800, Kevin SoftDev wrote:
> One issue left for me to figure out is that in different languages there are
> different characters and Xapian takes only english characters.

No, it doesn't only take english characters.

Xapian::Stem and Xapian::QueryParser currently assume iso8859-1 (which
covers most western european languages, plus some others), but should be
fixed to be able to handle utf-8 fairly soon.

Everything else treats the data as opaque, so is agnostic about encoding
issues.  The core library is zero-byte safe so wide characaters should
be fine too.  I've found (and fixed) code in the bindings (and SWIG)
which isn't zero-byte safe, but not done a thorough audit so you may
hit issues there still - if you do, please report them as they're easy
to fix once identified.

> Thefore many word entered by users that contains their own language special
> characters will not return any result. MySQL offers different collations ...

Assuming by a collation you mean a total order on pairs of strings, I
don't plan to implement that in Xapian, because I think it's better
addressed externally (for reasons of efficiency mainly).

The sort order aspect of a collation would affect TermIterators, but it
would be expensive to make a TermIterator return terms in anything other
than the natural order.  I think if you need terms ordered in a
particular way it's better to gather those you want and then sort them.

The "different character strings comparing equal" aspect really needs to
be handled by converting them to a canonical form when generating terms.  
Otherwise you're going to need to do an "OR" query for any single term
which is affected by this.

You could potentially allow a single collation to be specified for
ordering (and treating as equal or not) terms at the Btree table
manager level, but you couldn't change it except by rebuilding the
database from scratch, and it would complicate the Btree manager a lot.

Cheers,
    Olly



More information about the Xapian-discuss mailing list