Xapian 1.4.0 released

Olly Betts olly at survex.com
Mon Sep 12 05:51:32 BST 2016


On Mon, Jul 25, 2016 at 01:48:02PM -0700, Kevin Duraj wrote:
> Now imagine my situation and probably others too, when we are working
> with big data. I select 1 billion of YouTube videos, and then I index
> it with Xapian. Now a kid uploads Pokemon video and for some reason,
> the kid keeps pressing a single key on the keyboard until the term
> become 500 characters long (e.g., EEEEEEE).
> 
> Xapian index is running and after it has indexed 500 million
> documents, suddenly come to the kid Pokemon video with 500 characters
> long term in the description and Xapian will stop the entire index,
> saying that "Term too long > 245."

No, TermGenerator will skip the term because it is longer than
max_word_length (which you can set through the API but defaults to 64
bytes).

You'll only hit this exception if you set max_word_length much higher
than the default, or if you directly call add_term() and/or
add_posting() directly instead of using TermGenerator, or when
calling add_boolean_term().  So with modern API use, you will only need
to check boolean terms fit in the length limit (and those are the case
where blindly truncating is most problematic).

> I think, a log file with a warning would be sufficient stating the
> document id, the term that is too long. Of course, I can fix it by
> myself and check every terms length, but that will add more overhead
> to big data computing.

There's not currently a log file to log this to.

Cheers,
    Olly



More information about the Xapian-discuss mailing list