[Xapian-discuss] omindex character sets

Olly Betts olly at survex.com
Thu Feb 7 02:09:32 GMT 2008


On Wed, Feb 06, 2008 at 09:52:19AM +0000, Homer wrote:
> I compiled xapian / omega on a windows box.
> the omindex did not work for me, because the indexing seemed to hang, while
> indexing text / html.

Did you build with mingw or MSVC?

Can you find out where it hangs by attaching a debugger, or running
omindex under a debugger?

> Now the problem:
> i was indexing some documents with german special characters inside the document
> and inside the path to the document.
> when i use the omega search cgi, some special characters in the path to the
> document are screwed up.
> i think the path is encoded using iso-8859-1 and the main content is endoded
> using utf-8.
> 
> is this true, or am i just doing some beginners mistakes?
> would be nice if someone can tell me how to fix this.

It's a bug, though it's not totally clear to me how best to fix this.

So the URL we use as a link just wants to have top-bit-set characterss
"% encoded" (as I believe they already are at display time).

We could just display the URL to the user the same way.  That's a bit
ugly, but it is actually the URL that is being used so it's "honest"
at least.  It would perhaps be nicer to show the URL with these
characters "decoded" though.  It certainly would when the document
doesn't have a title and we use the URL for the title.

I don't know how MS Windows or Cygwin handles the encoding of filenames.
On Linux you can use the locale as a hint, but that may be incorrect.
In fact different files in the same directory can have different
encodings.

We can tell ISO-8859-1 and UTF-8 apart fairly reliably, by assuming
UTF-8 unless the filename isn't valid UTF-8 in which case we assume
ISO-8859-1.  That works well in practice as the ISO-8859-1 strings it
misinterprets aren't those you'd usually actually encounter.

So perhaps the best answer is to have an OmegaScript command which
performs this transformation.

Cheers,
    Olly



More information about the Xapian-discuss mailing list