[Xapian-discuss] Is there a 64 character term size limit? In Ruby bindings?

Olly Betts olly at survex.com
Thu Jun 10 04:17:10 BST 2010


On Wed, Jun 09, 2010 at 04:19:33PM +0200, Henry C. wrote:
> On Tue, June 8, 2010 03:32, Olly Betts wrote:
> > On Mon, Jun 07, 2010 at 07:38:08PM +0100, Francis Irving wrote:
> >
> >> I've just found some items in my Xapian database which aren't being
> >> indexed, when the terms are quite long.
> >>
> >> Example term:
> >> Frotherham_doncaster_and_south_humber_mental_health_nhs_foundation_trust
> 
> I've run into this as well using the Perl bindings.  I found discussions
> regarding this with Omega where truncate/hash is used (correct me if this
> not related).
> 
> My issue is that exceptions (ie, "Exception: Key too long: length was...")

This is a different issue.  Francis was hitting a deliberate length limit
(64 bytes) on terms parsed from text.

You are hitting the Btree key size limit.  For flint and chert, this
translates to a term length limit of 245 bytes.

> are sometimes only thrown when indexing is complete and I flush/close.  I
> may be wrong about when it's thrown since I'm looking at log files.

If you are using Xapian >= 1.0.3 then the term limit should be checked when
you call add_document() or replace_document().  If you're getting an error
later then either your terms have zero bytes in (which currently need to
be escaped in the Btree keys) or there's a bug (in which case a testcase
would be useful).

> Other times, the exception will occur followed by another: "Unexpected
> end of table when reading continuation of tag..." -- this is probably
> because of the unhandled initial exception.

An exception shouldn't cause problems like that.  Again, a testcase would
be useful.

> Anyway, when using Perl, how can I either truncate (to say, 239) or hash
> the key to prevent this error from occurring?  My real-world data can be
> quite dirty, so I need to gracefully handle this issue.

Good hashing is somewhat domain-specific, but you could just index the MD5
or SHA1 of the term.

Truncation is easy:

    $term = substr($term, 0, 239);

The actual limit is 245 (except zero bytes count twice) for flint and
chert.

Cheers,
    Olly



More information about the Xapian-discuss mailing list