[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Sat Feb 25 21:25:52 GMT 2006

But if the Xapian queryparser doesn't currently support UTF-8 that imply two 
possibilities

1) The indexers from the Omega project don't support UTF-8 either
or
2) The Xapian queryparser and the indexers from Omega don't use the same 
algorithms to split strings into words!

My problem is still present: I want to be sure the words indexed are 
separated the same way the words from the querystrings will!

Therefore I guess the best solution for now if to write you own queryparser 
and your own indexer, both using the SAME algorithm to split words.

If I take that solution the only problem remaining is to find a bullet proof 
way to split UTF-8 in PHP.

----- Original Message ----- 
From: "Jim Lynch" <jim at fayettedigital.com>
To: <xapian-discuss at lists.xapian.org>
Sent: Saturday, February 25, 2006 1:30 PM
Subject: Re: [Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

> I'm using a combination of scriptindex and omega to index german language 
> texts and the words do not split on accented characters.  E. g. 
> höchstpersönlichen remains höchstpersönlichen and a search for it finds it 
> fine.  What does happen is that xapian does transliterate the accented 
> characters into diagraphs but since these are unique it does't make any 
> difference unless you want to use the term list that is returned for 
> something.
> Olly posted a patch recently to eliminate that behavior.  While omega is a 
> cgi program it does not mean you cannot use it to search a database and 
> return results to a program.  In fact, that's the way I'm using it myself. 
> I use html2text to produce plain text and read the text in and format it 
> in a way that scriptindex likes it.  I then have my search program call 
> omega to return a xml file to me with the results.  I am using it in cgi 
> mode, just 'cause that is convinient but I could have called it via a exec 
> call just as easily.
> Hope that helps.
>
> Jim.
> tata 668 wrote:
>
>> Hi,
>>
>> It's my first message in this mailing list, I hope I'm sending it to the 
>> correct address. I'm also new to Xapian and my english is not perfect.
>>
>> I test Xapian from PHP 4.4.1, using the bindings, and it works pretty 
>> well. Thanks to everyone involved in this project!
>>
>> My questions:
>>
>> 1) Am I correct when I say that Xapian doesn't provide an indexer 
>> function? I mean, from what I understand, the only way to index a text in 
>> Xapian is to split it, word by word, *by ourself*, and then to insert, 
>> one by one, those words in Xapian using Document::add_term(). There are 
>> no Xapian function that would take a whole text, splits the words by 
>> itself and indexes them, right? I have to write my own indexer, my own 
>> string splitting function. Is that correct? (And I don't think I want to 
>> look at Omega because I do not indexe webpages, I'm using Xapian to 
>> indexe some custom text inside my application, to provide a fast 
>> plain-text search functionality.)
>>
>> 2) My second question is related to the queryparser. I've heard that 
>> UTF-8 support is not yet available in release versions. I'm not a C or 
>> C++ programmer so I'd prefere not to mess with patches ( 
>> http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ). But 
>> anyway, I don't need full support for my queries so I wrote my own, UTF-8 
>> aware, queryparser even if it's not perfect (see question #3).
>>
>> Here's my question: I don't understand how you can use your own parsing 
>> method for indexing (see question #1) AND use the provided Xapian 
>> queryparser (even if it would support UTF-8)! Am I missing something or 
>> both sides (the indexing and the queryparsing) have to use the same 
>> splitting algorithm if you want the results to be correct. If my indexing 
>> algorithm splits "aaaÏbbb" into one word only ("aaaÏbbb") but the Xapian 
>> queryparser doesn't considere "Ï" as an alphanumeric character and 
>> therefore splits the string into two words ("aaa" and "bbb"), my search 
>> results won't be correct, right? So I don't see how it is possible to 
>> rely on a provided queryparser if there is no indexing function also 
>> provided that would use the exact same splitting algorithm.
>>
>> 3) If someone has experience with splitting UTF-8 strings into words 
>> using PHP 4, I would be really happy. I though  mb_split("\W", $text) ; 
>> would do the job but it seems that it considers some characters as 
>> alphanumeric (ie: "´") where, I think, it shouldn't. Any help?
>>
>>
>> Thanks,
>>
>> Jules Landry
>>
>>
>>
>>
>>
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>>
>>
>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss