[Xapian-discuss] Perl binding: crash & missing functions?

Olly Betts olly@survex.com
Wed, 5 May 2004 00:55:26 +0100


On Tue, May 04, 2004 at 09:58:17PM +0200, Sander Pilon wrote:
> Could it be unicode-related? (The documents I'm trying to index could
> contain unicode (UTF-8))

Xapian doesn't really care what is in terms or data (expect for the
stemmers of course).  It's 8-bit clean, and should also be zero byte
clean except that zero bytes take up extra room in the internal storage
scheme for terms, so a term with zero bytes can't be as long as one
without.

> Are there certain terms Xapian doesn't like?

There's a limit on the term length - is slightly over 240 bytes (I don't
recall the exact value offhand).  Each zero byte counts double so a term
of all zero bytes can't be more than just over 120 bytes.

The limit actually comes from the keys of the posting list B-tree
inside the quartz backend - for a common term, the list is split into
chunks, and these are keyed on the termname and first document id
in the chunk.

There's currently an odd effect where the exact length limit depends on
the encoded length of this document id (this should really be fixed by
enforcing a standard limit rather than letting the Btree catch it).
Perhaps that's what you're hitting, and why running the indexer multiple
times avoids the problem (because the documents are added in a different
order).

You're limiting terms with positional info to 64 characters - only URL
terms can be longer than 240-ish.  I suspect you've got a common
URL which has length between 240 and 250 characters.  Change the URL
length check to "> 240" instead of "> 512" and all should be well.

If you want to index longer terms, look at the technique used in
omega's omindex.cc where the tail of the URL is hashed.

> (Still, no excuse for "Aborted" ... )

Indeed.  This case throws Xapian::InvalidArgumentError in C++ (I just
tested it to make sure).  It looks like the Perl bindings only actually
check for C++ exceptions when opening a Database or WritableDatabase
so it's probably not being handled by anything which is why we end
up with just "Aborted".

Cheers,
    Olly