[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Sat Feb 25 18:30:39 GMT 2006

I'm using a combination of scriptindex and omega to index german 
language texts and the words do not split on accented characters.  E. g. 
höchstpersönlichen remains höchstpersönlichen and a search for it finds 
it fine.  What does happen is that xapian does transliterate the 
accented characters into diagraphs but since these are unique it does't 
make any difference unless you want to use the term list that is 
returned for something. 

Olly posted a patch recently to eliminate that behavior.  While omega is 
a cgi program it does not mean you cannot use it to search a database 
and return results to a program.  In fact, that's the way I'm using it 
myself.  I use html2text to produce plain text and read the text in and 
format it in a way that scriptindex likes it.  I then have my search 
program call omega to return a xml file to me with the results.  I am 
using it in cgi mode, just 'cause that is convinient but I could have 
called it via a exec call just as easily. 

Hope that helps.

Jim.
tata 668 wrote:

> Hi,
>
> It's my first message in this mailing list, I hope I'm sending it to 
> the correct address. I'm also new to Xapian and my english is not 
> perfect.
>
> I test Xapian from PHP 4.4.1, using the bindings, and it works pretty 
> well. Thanks to everyone involved in this project!
>
> My questions:
>
> 1) Am I correct when I say that Xapian doesn't provide an indexer 
> function? I mean, from what I understand, the only way to index a text 
> in Xapian is to split it, word by word, *by ourself*, and then to 
> insert, one by one, those words in Xapian using Document::add_term(). 
> There are no Xapian function that would take a whole text, splits the 
> words by itself and indexes them, right? I have to write my own 
> indexer, my own string splitting function. Is that correct? (And I 
> don't think I want to look at Omega because I do not indexe webpages, 
> I'm using Xapian to indexe some custom text inside my application, to 
> provide a fast plain-text search functionality.)
>
> 2) My second question is related to the queryparser. I've heard that 
> UTF-8 support is not yet available in release versions. I'm not a C or 
> C++ programmer so I'd prefere not to mess with patches ( 
> http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ). But 
> anyway, I don't need full support for my queries so I wrote my own, 
> UTF-8 aware, queryparser even if it's not perfect (see question #3).
>
> Here's my question: I don't understand how you can use your own 
> parsing method for indexing (see question #1) AND use the provided 
> Xapian queryparser (even if it would support UTF-8)! Am I missing 
> something or both sides (the indexing and the queryparsing) have to 
> use the same splitting algorithm if you want the results to be 
> correct. If my indexing algorithm splits "aaaÏbbb" into one word only 
> ("aaaÏbbb") but the Xapian queryparser doesn't considere "Ï" as an 
> alphanumeric character and therefore splits the string into two words 
> ("aaa" and "bbb"), my search results won't be correct, right? So I 
> don't see how it is possible to rely on a provided queryparser if 
> there is no indexing function also provided that would use the exact 
> same splitting algorithm.
>
> 3) If someone has experience with splitting UTF-8 strings into words 
> using PHP 4, I would be really happy. I though  mb_split("\W", $text) 
> ; would do the job but it seems that it considers some characters as 
> alphanumeric (ie: "´") where, I think, it shouldn't. Any help?
>
>
> Thanks,
>
> Jules Landry
>
>
>
>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
>
>