[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Sat Feb 25 16:54:51 GMT 2006

Hi,

It's my first message in this mailing list, I hope I'm sending it to the 
correct address. I'm also new to Xapian and my english is not perfect.

I test Xapian from PHP 4.4.1, using the bindings, and it works pretty well. 
Thanks to everyone involved in this project!

My questions:

1) Am I correct when I say that Xapian doesn't provide an indexer function? 
I mean, from what I understand, the only way to index a text in Xapian is to 
split it, word by word, *by ourself*, and then to insert, one by one, those 
words in Xapian using Document::add_term(). There are no Xapian function 
that would take a whole text, splits the words by itself and indexes them, 
right? I have to write my own indexer, my own string splitting function. Is 
that correct? (And I don't think I want to look at Omega because I do not 
indexe webpages, I'm using Xapian to indexe some custom text inside my 
application, to provide a fast plain-text search functionality.)

2) My second question is related to the queryparser. I've heard that UTF-8 
support is not yet available in release versions. I'm not a C or C++ 
programmer so I'd prefere not to mess with patches ( 
http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ). But anyway, 
I don't need full support for my queries so I wrote my own, UTF-8 aware, 
queryparser even if it's not perfect (see question #3).

Here's my question: I don't understand how you can use your own parsing 
method for indexing (see question #1) AND use the provided Xapian 
queryparser (even if it would support UTF-8)! Am I missing something or both 
sides (the indexing and the queryparsing) have to use the same splitting 
algorithm if you want the results to be correct. If my indexing algorithm 
splits "aaaÏbbb" into one word only ("aaaÏbbb") but the Xapian queryparser 
doesn't considere "Ï" as an alphanumeric character and therefore splits the 
string into two words ("aaa" and "bbb"), my search results won't be correct, 
right? So I don't see how it is possible to rely on a provided queryparser 
if there is no indexing function also provided that would use the exact same 
splitting algorithm.

3) If someone has experience with splitting UTF-8 strings into words using 
PHP 4, I would be really happy. I though  mb_split("\W", $text) ; would do 
the job but it seems that it considers some characters as alphanumeric (ie: 
"´") where, I think, it shouldn't. Any help?

Thanks,

Jules Landry