[Xapian-discuss] encoding?

Olly Betts olly at survex.com
Sat Apr 1 00:46:32 BST 2006

Please don't send essentially the same message to the list multiple
times (less than 90 minutes apart too!)  And don't cc: individual
developers - we all read the list so you'll just annoy people and be
less likely to get a useful answer.  Overall, remember this mailing list
is a free resource, and nobody is under any obligation to help you.  So
if you want help, play nicely and respect the other list members.

On Fri, Mar 31, 2006 at 06:28:37PM +0530, Gupteshwar Joshi wrote:
> Does omega supports different kind of encodings for searching the the
> indexed data .

Currently Omega doesn't perform any character encoding conversions.
So if you're trying to handle a non-latin language, you'll probably
be disappointed.

> I have applied the indexing on all the documents of english+devnagari
> language.

Sorry, I don't know what encoding devnagari requires.

> It  does work without prompting any error if i consider that my local data
> too is indexed then it is not showing any reult for devnagari key .

Assuming devnagari uses a non-latin character set, then the word
tokeniser won't tokenise devnagari words correctly (or at all in fact).

The plan for Xapian 1.0 is to fix Omega to convert everything to utf-8
and use unicode definitions of what is a word character, etc.  Then this
should all work.

Meanwhile, if you're prepared to write your own indexer (or at least
your own word tokeniser), then there's a patch to make the QueryParser
utf-8 aware (which is what the gmane search uses).

>      I have attached meta tag for encoding type in head query template but
> still it doesnt searching for those key words.

Well, that only tells the browser what character set the output is in so
it's not going to affect the searching.

Incidentally, a slightly better approach than a meta tag is to set the
charset in the Content-Type: header of the response by adding something
like this to the top of the query template:

$httpheader{Content-Type,text/html; charset=utf-8}


