capitialization vs stemming: missing quest results

Peter Marquardt marquardt_p at molgen.mpg.de
Tue Apr 4 12:54:33 BST 2017


Hi,

since trac doesn't respond to my subscription request, I'll try this way.

TLDR: echo "krankenkassen" > i/input.txt; omindex i/ --stemmer=german; 
quest "krankenkassen" -> not found. see 
https://gist.github.com/anonymous/609f82a065f3d0ac6b1d077073be286f for 
full script & output

LONG:

- create an omindex db with the lower case word "krankenkassen"
- create an 'omindex' with --stemmer=german
- try to find the words "Krankenkassen" and "krankenkassen

result: "Krankenkassen" is found, "krankenkassen" isn't.

------------------ rrrrrrrrrrrrrrrip -----------------
#!/bin/bash -x

# directory with txt files
INPUT=./testinput

LOWER=bewirtung
UPPER=Bewirtung

mkdir ${INPUT}

# store one lowercase word
echo ${LOWER} > ${INPUT}/lower.txt

# who am i
omindex --version
quest --version

# clean up database 8)
rm -rf testdb

# create omega index, url doesn't matter
omindex --verbose --db=testdb --url=/bla ${INPUT}

# query database for word in Upper and lower case
quest --db=testdb ${UPPER} | tee test-nostem.out
quest --db=testdb ${LOWER} | tee -a test-nostem.out

# should have been fine.


# now ... clean up the database 8)
rm -rf testdb

# create omega index, use german stemmer
omindex --verbose --db=testdb --url=/bla --stemmer=german ${INPUT}


# try again and query database for word in Upper and lower case
quest --db=testdb ${UPPER} | tee test-stem.out
quest --db=testdb ${LOWER} | tee -a test-stem.out

# the 'lower case' one should fail. which is weird.

diff test-nostem.out test-stem.out

------------------ rrrrrrrrrrrrrrrap -----------------


and the resulting output:

------------------ rrrrrrrrrrrrrrrip -----------------
+ INPUT=./testinput
+ LOWER=krankenkassen
+ UPPER=Krankenkassen
+ mkdir ./testinput
mkdir: cannot create directory ‘./testinput’: File exists
+ echo krankenkassen
+ omindex --version
omindex - xapian-omega 1.4.3
+ quest --version
quest - xapian-core 1.4.3
+ rm -rf testdb
+ omindex --verbose --db=testdb --url=/bla ./testinput
[Entering directory ""]
Indexing "lower.txt" as text/plain ... added
+ quest --db=testdb Krankenkassen
+ tee test-nostem.out
Parsed Query: Query(krankenkassen at 1)
MSet:
1: [0.154151]
url=/bla/lower.txt
sample=krankenkassen
type=text/plain
modtime=1491300443
size=14
+ quest --db=testdb krankenkassen
+ tee -a test-nostem.out
Parsed Query: Query(Zkrankenkassen at 1)
MSet:
1: [0.154151]
url=/bla/lower.txt
sample=krankenkassen
type=text/plain
modtime=1491300443
size=14
+ rm -rf testdb
+ omindex --verbose --db=testdb --url=/bla --stemmer=german ./testinput
[Entering directory ""]
Indexing "lower.txt" as text/plain ... added
+ quest --db=testdb Krankenkassen
+ tee test-stem.out
Parsed Query: Query(krankenkassen at 1)
MSet:
1: [0.154151]
url=/bla/lower.txt
sample=krankenkassen
type=text/plain
modtime=1491300443
size=14
+ quest --db=testdb krankenkassen
+ tee -a test-stem.out
Parsed Query: Query(Zkrankenkassen at 1)
MSet:
+ diff test-nostem.out test-stem.out
11,16d10
< 1: [0.154151]
< url=/bla/lower.txt
< sample=krankenkassen
< type=text/plain
< modtime=1491300443
< size=14
------------------ rrrrrrrrrrrrrrrap -----------------

and for completeness:

# xapian-delve -a testdb

All terms in database: D20170404 Etxt Flower I* J/bla M201704 Owwwutz 
P/bla Ttext/plain U/bla/lower.txt Y2017 ZFlow Zkrankenkass krankenkassen

Funny enough singular ("Krankenkasse") works fine 8)

I'm a complete xapian noob, so what am I doing wrong ?

cheers,

	Peter Marquardt




More information about the Xapian-devel mailing list