Search requests should ignore accents (C++ API)?

Kim Walisch kim.walisch at gmail.com
Thu Jul 26 07:41:03 BST 2018


Thanks for your detailed answer!

Kim

On Thu, Jul 26, 2018 at 12:42 AM Olly Betts <olly at survex.com> wrote:

> On Wed, Jul 25, 2018 at 04:33:58PM +0200, Kim Walisch wrote:
> > I am using libxapian in a C++ project (hence I am using Xapian's C++ API)
> > and some user has requested that search requests should ignore accents.
> > E.g. when the user searches for "Herr Müller" he expects that "Herr
> Muller"
> > is also a search hit.
>
> Simply stripping accents can be a reasonable normalisation in some
> languages, but it's not always appropriate.  The most obvious problem
> is for languages where there are words with different meanings which
> differ in spelling only by their accents.
>
> There are also languages where you don't just drop the accent if
> you aren't able to write it.  German is actually an example of that -
> you'd write "Mueller" if you weren't able to write the accent, not
> "Muller".
>
> > Is this possible in Xapian?
>
> Some of the stemmers normalise accents in some or all cases.  That only
> helps when the stemmed form is being matched though, and is less useful
> for real names.
>
> > Do you have any links to the documentation of that feature?
>
> People often use https://www.nongnu.org/unac/ for this, though it
> doesn't look like it's very actively maintained (or else there's a
> new home page for it I couldn't trivially find).  That seems to
> just drop the umlaut though.
>
> There's also g_str_to_ascii () in glib:
>
>
> https://developer.gnome.org/glib/stable/glib-String-Utility-Functions.html#g-str-to-ascii
>
> That seems to be a bit too heavy a hammer for normalising text for
> search though as you really don't want pure ASCII out in every case
> (particularly for languages which don't use the Latin alphabet).
>
> ICU is another option - there's an example of removing accents by
> decomposing, removing non-spacing marks, then recomposing here:
>
> http://userguide.icu-project.org/transforms/general
>
> But again, that would just drop the umlaut.
>
> Cheers,
>     Olly
>


More information about the Xapian-discuss mailing list