Search requests should ignore accents (C++ API)?

Olly Betts olly at survex.com
Wed Jul 25 23:42:25 BST 2018


On Wed, Jul 25, 2018 at 04:33:58PM +0200, Kim Walisch wrote:
> I am using libxapian in a C++ project (hence I am using Xapian's C++ API)
> and some user has requested that search requests should ignore accents.
> E.g. when the user searches for "Herr Müller" he expects that "Herr Muller"
> is also a search hit.

Simply stripping accents can be a reasonable normalisation in some
languages, but it's not always appropriate.  The most obvious problem
is for languages where there are words with different meanings which
differ in spelling only by their accents.

There are also languages where you don't just drop the accent if
you aren't able to write it.  German is actually an example of that -
you'd write "Mueller" if you weren't able to write the accent, not
"Muller".

> Is this possible in Xapian?

Some of the stemmers normalise accents in some or all cases.  That only
helps when the stemmed form is being matched though, and is less useful
for real names.

> Do you have any links to the documentation of that feature?

People often use https://www.nongnu.org/unac/ for this, though it
doesn't look like it's very actively maintained (or else there's a
new home page for it I couldn't trivially find).  That seems to
just drop the umlaut though.

There's also g_str_to_ascii () in glib:

https://developer.gnome.org/glib/stable/glib-String-Utility-Functions.html#g-str-to-ascii

That seems to be a bit too heavy a hammer for normalising text for
search though as you really don't want pure ASCII out in every case
(particularly for languages which don't use the Latin alphabet).

ICU is another option - there's an example of removing accents by
decomposing, removing non-spacing marks, then recomposing here:

http://userguide.icu-project.org/transforms/general

But again, that would just drop the umlaut.

Cheers,
    Olly



More information about the Xapian-discuss mailing list