[Xapian-discuss] queryparser thinks

Wed Sep 14 08:24:37 BST 2005

On Tue, Sep 13, 2005 at 02:35:45PM +0100, Olly Betts wrote:
> On Tue, Sep 13, 2005 at 11:31:19AM +0200, Ralf Mattes wrote:
> > On Tue, Sep 13, 2005 at 05:08:08AM +0100, Olly Betts wrote:
> > > It's more germanocentric if anything.  
> > 
> > Well, but in German 'accents' (umlauts et. al.)  _do_ carry meaning.
> 
> Yes, but there's a standard way to write a word if you can't (or don't
> know how to) write the accents.  There are also regional variations.
> For example, I'm told that &szlig; is rarely used in Swiss German
> speaking areas - instead they write "ss".  

Indeed, but that's hardly comparable. As the name SZ-ligature already 
says, this special glyph is in fact a ligature (of a so-called "long
s" (written like an 'f' without dash) and a "Z-shaped s" that only 
occured at the end of words). This is a leftover of medieval
calligraphic conventions (admittedly, vew german speakers would know
about).

> And the orthography change a
> few years back means that some words which were formally written with
> &szlig; now often aren't in Germany itself (I believe "muss" instead of
> "mu&szlig;" is a common example).

This is the prefered spelling by now (to unify german/swiss/austrian 
spelling).

> This is also the case when writing in capital letters as there's no
> capitalised form of &szlig; (so you see "EINBAHNSTRASSE" for "one way
> street").
> 
> > > The transliteration should also really be language dependent - in German
> > > &auml; -> ae, 
> > 
> > That's a typographic convention used in circumstances where Umlaut
> > glyphs aren't available (1970 TELETYPE ....).
> 
> It's true that this is probably less useful than it was back with
> Muscat 3.6 (about 10 years ago).  But it seems a lot of people still
> use their teletypes:

;-) Actually, having grown up with a 7-bit mailer i still usually type
these transliterations. But most people i know would consider this 
old fashioned. And i know that on web pages 'ae' for '&auml;' would
not be acceptable (even so ZEIT is a german only newspaper we decided
to store all content in UTF-8. There's an astonishing number of
documents that actually _do_ use characters from areas outside  latin1.
Mixing the display of such content with transliterated umlaut glyphs
looks funny).
Just trying to find an english equivalent: technically 'v' and 'u'
are calligraphic variants of the same glyph. The old-english
thorn glyph was often transliterated as a 'p' and the 'th'-ligature
got transliterated as 'y' ... you still wouldn't want your queries
to use these transliterations ;-}

> http://www.google.co.uk/search?q=H%C3%B6hle gives 2940000 hits
> 
> http://www.google.co.uk/search?q=Hoehle gives 267000 hits
> 
> That's about 9%.
> 
> Interestingly, if you look at the results for the first it seems Google
> simple drops the umlaut when matching so that 2940000 includes a number
> of hits for "Hohle".  That's worse than what we do, and also means that
> the 2940000 is probably a slight overestimate.

Yes, they seem to do so. Good example by the way, since 'h&ouml;hle'
(cave)and 'hohle' (empty), so ethymologicaly related, denote different terms.

I wouldn't pick Google as my role-model for search. The problem with
the umlaut-transliteration is that this sometimes creates matches.

> 
> > Presenting such a conversion to todays (web) users gives a rather
> > archaic touch to the website.
> 
> But this all happens behind the scenes (at least as much as possible).
> The stemmed form is what actually gets searched for, but when listing
> which terms match which documents, Omega maps this back to the forms
> the user actually entered.  Even in $topterms we try hard to avoid
> presenting the stemmed form (though we don't always manage it).

My main problem right now are the terms is get back from 'enquire.get_eset',
that's actually the reason we currently do not display eset-Terms on the 
public search page. I have the feeling that the stemmed terms seem to be 
less presentable in the german version than in the english. To be clear:
i don't have a simple solution for this righ now - otherwise i'd just
submit a patch. I do feel that the treatment of accents/umlauts is langauage
dependent and mut be handled by the stemmer (as we currently do on our
public search page). So far no complaints ...

> 
> > What was the reason for not using the latest snowball version in Xapian?
> 
> As James said, there have been some minor tweaks to the algorithms since
> the last time we imported a version.  Changing the algorithms makes
> existing databases incompatible so we avoid doing it too often, and
> try to do it in step with other database incompatible changes (and
> not at a point release).

Sorry, i didn't consider the backward-compatibility problem when i asked
this question. It's nice to know that even in a pre-1 version you care 
for your installed user base.
BTW, besides all this, i think ZEIT is really happy with Xapian - the
perceived quality of the search results seems much better than that of
the previous search engine at 'http://www.dwds.de'. That engine is
developed and maintained by the 'Digitales WÃ¶rterbuch der Deutschen
Sprache'-Group in Berlin. They _do_ have an excellent
Stemmer/Morpher/Tagger and their software does taxonomic classification
of terms (i.e. you search for 'adel' (nobility) and get results for
'herzog' (duke) 'graf'(count) as well). Unfortunately they don't have
any relevance ranking ....

Cheers, RalfD
> 
> Cheers,
>     Olly