[Xapian-devel] QueryParser and utf-8 strings

Radovan Garabik garabik at kassiopeia.juls.savba.sk
Fri Dec 9 10:50:35 GMT 2005


Hi all,
I am rather new to xapian, I just recently tried to include it in
my application, so bear with me if this has already been discussed.

I was playing with QueryParser and noticed that it expects
input to be in ISO8859_1 encoding - characters above 0x80 are
transliterated, and are not considered letters. For example,
using single word (in utf-8 encoding) "bože" as input for 
parse_query, the resulting query is something like:
Xapian::Query((boaa:(pos=1) OR e:(pos=2)))
which makes the parse_query quite unusable for UTF-8 strings (or 
indeed, for any encoding other than ISO8859_1).

I tried to disable the transliteration in 
accentnormalisingitor.h and modified common/utils.h to contain:

inline bool C_isalpha(char ch) {
    using namespace Xapian::Internal;
    return (static_cast<unsigned char>(ch)>=0x80) ||
(is_tab[static_cast<unsigned char>(ch)] & (IS_UPPER|IS_LOWER));
}

inline bool C_isalnum(char ch) {
    using namespace Xapian::Internal;
    return (static_cast<unsigned char>(ch)>=0x80) ||
(is_tab[static_cast<unsigned char>(ch)] & (IS_UPPER|IS_LOWER|IS_DIGIT));
}

since most of the characters above 0x80 are meant as letters, only with 
very few exceptions (non breaking spaces and punctuation, and 
people generally do not write queries using these characters).
Of course, the same effect can be achieved by modifying is_tab.

Now queries in my application work as expected :-)

I would suggest to make transliteration optional (or if not, remove it,
since it makes more harm than benefit), and to consider
all the chars above 0x80 to be letters (at least there is no
better solution unless full Unicode support is implemented, and THAT is
probably not worth the effort)

What do you say?

-- 
 -----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!




More information about the Xapian-devel mailing list