Xapian 1.4.0 released

Kevin Duraj kevin.duraj at zefr.com
Mon Jul 25 21:48:02 BST 2016

Now imagine my situation and probably others too, when we are working
with big data. I select 1 billion of YouTube videos, and then I index
it with Xapian. Now a kid uploads Pokemon video and for some reason,
the kid keeps pressing a single key on the keyboard until the term
become 500 characters long (e.g., EEEEEEE).

Xapian index is running and after it has indexed 500 million
documents, suddenly come to the kid Pokemon video with 500 characters
long term in the description and Xapian will stop the entire index,
saying that "Term too long > 245."

I think, a log file with a warning would be sufficient stating the
document id, the term that is too long. Of course, I can fix it by
myself and check every terms length, but that will add more overhead
to big data computing.

On Sun, Jul 24, 2016 at 7:16 AM, James Aylett <james-xapian at tartarus.org> wrote:
> On Fri, Jul 22, 2016 at 07:19:43PM -0700, Kevin Duraj wrote:
>> I would like to propose to change the following code while indexing a
>> term that is larger than 245 characters and then crashing and aborting
>> the entire index, we could rather truncate the term to 245 characters
>> and continue with indexing.
> Kevin -- I wonder what others are currently doing when this comes up
> (or if they're just ignoring it). Another approach, which I've
> mentioned on the PR, might be to auto-truncate terms earlier in the
> process, using a convenience function wrapped inside a call to
> `add_term()` and similar. This would allow people who find use for the
> exception to continue using things that way.
> Alternatively, maybe we could find a way of configuring this
> behaviour. I certainly see the benefit in some situations of being
> able to just fling data at an indexer and not worry over-much about
> long terms, which are mostly flotsam anyway in a lot of applications.
> Anyone else have any thoughts? Now is a good time to think about
> things like this.
> (I'm not a fan of silent truncation; it's bitten me on too many other
> EIS in the past. Choosing it deliberately is of course another matter.)
> J
> --
>   James Aylett, occasional trouble-maker
>   xapian.org

More information about the Xapian-discuss mailing list