[Xapian-discuss] Is there a 64 character term size limit? In Ruby bindings?
Olly Betts
olly at survex.com
Thu Jun 10 04:17:10 BST 2010
On Wed, Jun 09, 2010 at 04:19:33PM +0200, Henry C. wrote:
> On Tue, June 8, 2010 03:32, Olly Betts wrote:
> > On Mon, Jun 07, 2010 at 07:38:08PM +0100, Francis Irving wrote:
> >
> >> I've just found some items in my Xapian database which aren't being
> >> indexed, when the terms are quite long.
> >>
> >> Example term:
> >> Frotherham_doncaster_and_south_humber_mental_health_nhs_foundation_trust
>
> I've run into this as well using the Perl bindings. I found discussions
> regarding this with Omega where truncate/hash is used (correct me if this
> not related).
>
> My issue is that exceptions (ie, "Exception: Key too long: length was...")
This is a different issue. Francis was hitting a deliberate length limit
(64 bytes) on terms parsed from text.
You are hitting the Btree key size limit. For flint and chert, this
translates to a term length limit of 245 bytes.
> are sometimes only thrown when indexing is complete and I flush/close. I
> may be wrong about when it's thrown since I'm looking at log files.
If you are using Xapian >= 1.0.3 then the term limit should be checked when
you call add_document() or replace_document(). If you're getting an error
later then either your terms have zero bytes in (which currently need to
be escaped in the Btree keys) or there's a bug (in which case a testcase
would be useful).
> Other times, the exception will occur followed by another: "Unexpected
> end of table when reading continuation of tag..." -- this is probably
> because of the unhandled initial exception.
An exception shouldn't cause problems like that. Again, a testcase would
be useful.
> Anyway, when using Perl, how can I either truncate (to say, 239) or hash
> the key to prevent this error from occurring? My real-world data can be
> quite dirty, so I need to gracefully handle this issue.
Good hashing is somewhat domain-specific, but you could just index the MD5
or SHA1 of the term.
Truncation is easy:
$term = substr($term, 0, 239);
The actual limit is 245 (except zero bytes count twice) for flint and
chert.
Cheers,
Olly
More information about the Xapian-discuss
mailing list