[Xapian-discuss] Problem getting Xapian working with Burmese

Sun Jan 31 10:31:03 GMT 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

emmanuel at engelhart.org a écrit :
>  On Fri, Aug 21, 2009 at 02:44:44PM +0200, emmanuel at engelhart.org wrote:
>>> I want to update my request.
>>> Is my question bad formulated? too trivial? ... or maybe pretty
>>> complicated/unclear?
>> I think nobody answered as it was hard to follow your example because
>> the Burmese characters seem to have been mangled (at least the message I
>> received wasn't valid utf-8).
>>
>> But looking at the code, I see an issue:
>>
>>> my $db = Search::Xapian::Database->new( './xapdb' );
>>> my $enq = $db->enquire( $ARGV[0] );
>> What this does is to create an Enquire object and set Query($ARGV[0]) as
>> the query.  That works OK if $ARGV[0] is a single word which gets
>> indexed as a single term, but you really want to parse the query string
>> to get a Query object:
>>
>>    my $db = Search::Xapian::Database->new( './xapdb' );
>>    my $queryparser = Search::Xapian::QueryParser->new();
>>    my $query = $queryparser->parse_query( $ARGV[0] );
>>    my $enq = $db->enquire( $query );
>>
>> I'd guess that is probably your problem, but I can't tell for sure as I
>> can't test your examples...
>>
>> For further information on debugging this sort of problem, see:
>>
>> http://trac.xapian.org/wiki/FAQ/NoMatches
>>
> 
> Hi Olly,
> 
> thank vor your answer (and sorry not having answered before).
> 
> Your answer helped me and I think I now understand why "it does not work".
> 
> For test purpose I index one document with one string  with index_text_without_positions() (C++ API) the string "ဝီကီပိသုံးစွဲသူများက"
> See this log: http://tmp.kiwix.org/tmp/kiwix-index.log (utf8 encoded)
> 
> But if I run "delve -r 1 /path/to/db" on the index I get following answer:
> Term List for record #1: test က စ ပ မ ဝ သ  (utf8 encoded)
> See the log : http://tmp.kiwix.org/tmp/delve.log
> 
> So, it seems to be clear for me why "it does not work" : my word is splitted in single lletters and a lot of letters are removed.
> 
> Do I'm right? Do we can avoid that and index "ဝီကီပိသုံးစွဲသူများက" as only one word?

I think, I more or less have understood what is wrong.

"ပဲရစ်" is the name of "Paris" in Burmese.

Here is the result of delve -r 1:
Term List for record #1: ပ ရစ

We can see that the diacritics were removed... and I think here is the
issue: the diacritics are interpreted as SEPARATOR by the tokenizer and
 that should not be the case because they are not "alone", but "belongs
to a letter".

Maybe something is wrong in Utf8Iterator or in is_wordchar()?

Regards
Emmanuel

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktlW+EACgkQn3IpJRpNWtNO9ACfXLkaFzPx5tSnoyaT+gwAshPx
rloAn2jVN5Ho+ix5apCJbt/mmulJt69+
=Z3P4
-----END PGP SIGNATURE-----