[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Jim Lynch jim at fayettedigital.com
Sun Feb 26 10:08:20 GMT 2006


All I can say is it works.  Take a look at 
http://jim.lynch.name/cgi-bin/firelex.cgi.  Do a search for 
höchstpersönlichen.  You'll see the word is complete and found.  Neither 
the indexer nor the search parser split the word.  To prove that try to 
find nlichen.  If it were splitting at the accent, it should find that 
but it doesn't.

Jim.
tata 668 wrote:

> But if the Xapian queryparser doesn't currently support UTF-8 that 
> imply two possibilities
>
> 1) The indexers from the Omega project don't support UTF-8 either
> or
> 2) The Xapian queryparser and the indexers from Omega don't use the 
> same algorithms to split strings into words!
>
> My problem is still present: I want to be sure the words indexed are 
> separated the same way the words from the querystrings will!
>
> Therefore I guess the best solution for now if to write you own 
> queryparser and your own indexer, both using the SAME algorithm to 
> split words.
>
> If I take that solution the only problem remaining is to find a bullet 
> proof way to split UTF-8 in PHP.
>
>
>
>
> -





More information about the Xapian-discuss mailing list