[Xapian-discuss] Xapian 1.0.0 released!
Olly Betts
olly at survex.com
Fri May 18 12:42:18 BST 2007
On Fri, May 18, 2007 at 09:53:21AM +0200, Arjen van der Meijden wrote:
> But I'm not sure if I understand the necessary changes needed for the
> change to UTF-8. As you know we generate a set of datafiles (if any
> encoding applies its probably ISO-8859-15) for scriptindex and
> afterwards omega to search through the data.
Well, if it's text, some encoding must apply. I guess you mean "other
than ISO-8859-1"?
> Is there any way to let scriptindex know we're feeding it ISO-8859-15
> rather than UTF-8?
Not yet. It would obviously be a useful feature, and not too hard to
implement, but it was an obvious one to leave out from the initial
release.
In fact, while any UTF-8 string is trivially a valid ISO-8859-1 string,
"real world" ISO-8859-1 doesn't look like valid UTF-8, and our UTF-8
handling code deals with invalid and overlong sequences by assuming
they're really ISO-8859-1, so you can probably just feed in ISO-8859-1
and it will be indexed magically converted to UTF-8. This hasn't been
tested much though so test carefully before deploying.
> The UTF-8 support in normal php installations isn't very good.
No, though the PHP iconv() function should be able to convert to/from
UTF-8.
> And is it also possible to let Omega know we are feeding it
> ISO-8859-15 and want that returned as well?
>
> Or are we required to supply those commands with UTF-8 data?
At the moment you'll always get UTF-8 out of Omega.
Cheers,
Olly
More information about the Xapian-discuss
mailing list