[Xapian-discuss] Problem getting Xapian working with Burmese
Emmanuel Engelhart
emmanuel at engelhart.org
Sun Jan 31 10:31:03 GMT 2010
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
emmanuel at engelhart.org a écrit :
> On Fri, Aug 21, 2009 at 02:44:44PM +0200, emmanuel at engelhart.org wrote:
>>> I want to update my request.
>>> Is my question bad formulated? too trivial? ... or maybe pretty
>>> complicated/unclear?
>> I think nobody answered as it was hard to follow your example because
>> the Burmese characters seem to have been mangled (at least the message I
>> received wasn't valid utf-8).
>>
>> But looking at the code, I see an issue:
>>
>>> my $db = Search::Xapian::Database->new( './xapdb' );
>>> my $enq = $db->enquire( $ARGV[0] );
>> What this does is to create an Enquire object and set Query($ARGV[0]) as
>> the query. That works OK if $ARGV[0] is a single word which gets
>> indexed as a single term, but you really want to parse the query string
>> to get a Query object:
>>
>> my $db = Search::Xapian::Database->new( './xapdb' );
>> my $queryparser = Search::Xapian::QueryParser->new();
>> my $query = $queryparser->parse_query( $ARGV[0] );
>> my $enq = $db->enquire( $query );
>>
>> I'd guess that is probably your problem, but I can't tell for sure as I
>> can't test your examples...
>>
>> For further information on debugging this sort of problem, see:
>>
>> http://trac.xapian.org/wiki/FAQ/NoMatches
>>
>
> Hi Olly,
>
> thank vor your answer (and sorry not having answered before).
>
> Your answer helped me and I think I now understand why "it does not work".
>
> For test purpose I index one document with one string with index_text_without_positions() (C++ API) the string "ဝီကီပိသုံးစွဲသူများက"
> See this log: http://tmp.kiwix.org/tmp/kiwix-index.log (utf8 encoded)
>
> But if I run "delve -r 1 /path/to/db" on the index I get following answer:
> Term List for record #1: test က စ ပ မ ဝ သ (utf8 encoded)
> See the log : http://tmp.kiwix.org/tmp/delve.log
>
> So, it seems to be clear for me why "it does not work" : my word is splitted in single lletters and a lot of letters are removed.
>
> Do I'm right? Do we can avoid that and index "ဝီကီပိသုံးစွဲသူများက" as only one word?
I think, I more or less have understood what is wrong.
"ပဲရစ်" is the name of "Paris" in Burmese.
Here is the result of delve -r 1:
Term List for record #1: ပ ရစ
We can see that the diacritics were removed... and I think here is the
issue: the diacritics are interpreted as SEPARATOR by the tokenizer and
that should not be the case because they are not "alone", but "belongs
to a letter".
Maybe something is wrong in Utf8Iterator or in is_wordchar()?
Regards
Emmanuel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAktlW+EACgkQn3IpJRpNWtNO9ACfXLkaFzPx5tSnoyaT+gwAshPx
rloAn2jVN5Ho+ix5apCJbt/mmulJt69+
=Z3P4
-----END PGP SIGNATURE-----
More information about the Xapian-discuss
mailing list