[Xapian-discuss] std::string arguments presumed to be UTF8?

Olly Betts olly at survex.com
Tue Nov 15 02:20:32 GMT 2011


On Mon, Nov 14, 2011 at 12:20:11PM +0000, James Aylett wrote:
>  * std::string should never be presumed to be UTF8. Terms, for
>  instance, are just treated internally as byte arrays (but are
>  commonly used to store strings, hence using std::string for
>  convenience in C++).
> 
>  * The TermGenerator, and a few other pieces of Xapian, *do* act on
>  UTF8, since they operate at a level that is dealing with actual
>  characters, so there has to be a defined encoding.

Yes, that's spot on - if the we need to look at characters, then the
encoding matters and should be UTF-8.  Otherwise you can put any byte
sequences in the input.

> Unfortunately, this isn't terribly clear from the documentation. 

There's already a note about improving that at:
http://trac.xapian.org/wiki/MissingDocumentation
(It may appear that list just gets longer, but we are addressing things
from it, just more get added too...)

Cheers,
    Olly



More information about the Xapian-discuss mailing list