[Xapian-discuss] Input files and special chars and spaces

Olly Betts olly at survex.com
Wed Sep 7 00:01:24 BST 2005


On Mon, Sep 05, 2005 at 07:35:13PM +0200, Floris Bos wrote:
> I'm using scriptindex input 
> files to put data in the the Xapian db. This works great as long as I don't 
> use any special characters. As soon as I try to add a document that 
> includes a text with special characters like for example: ' / or \
> only the part before the special char is added to the db.

A word such as "doesn't" is currently indexed by scriptindex as
"doesn" and "t".  Assuming you index with positional information, when
searching, "doesn't" is treated as a phrase search so will match as you
want.  This approach also allows "Olly's" to be matched by a search for
"olly".

Similarly, "/etc/passwd" is indexed as "etc" and "passwd", and searching
for it generates a phrase search.  But you can also search for "etc" and
"passwd" separately and the document will match.

The downside of this is that some of the phrase searches we generate in
this way can be rather slow with a big database, so this is an area
which is likely to be revisited:

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=22

Also the latest snowball stemmers can make user of apostrophes (at least
in English).

> I read somewhere (can't recall where) that Omega replaces these chars
> with spaces when using regular indexing.

I'm not sure "replacing with spaces" is the best way to think about it.
Such characters are simply treated as word breaks, like spaces are.

> Do I also need to do this replacing when using input files?

No.

> I know that this doesn't influence searching but I'd like to have 
> the possibility to use at leat the ' char for in the sample field because 
> this is a char that often occurs in dutch language.

But you can search for text containing "'" (unless you're not storing
positional information).

> Is it possible to create a sample text in the Xapian db that includes 
> special chars?

Are you asking about the sample text stored in the document data and
used in the search results?  That can contain any character (even zero
bytes).

Cheers,
    Olly



More information about the Xapian-discuss mailing list