[Xapian-discuss] Limitation of the terms size

Thu Mar 26 14:52:10 GMT 2009

Le Wed, 25 Mar 2009 03:18:32 +0000,
Olly Betts <olly at survex.com> a écrit :

> On Tue, Mar 24, 2009 at 03:00:33PM +0000, James Aylett wrote:
> > On Tue, Mar 24, 2009 at 03:55:41PM +0100, David Versmisse wrote:
> > 
> > > A small question: i found in the src that a term has a limitation
> > > of 245 characters (#define MAX_SAFE_TERM_LENGTH 245). Do you plan
> > > to change this limitation in the future versions?
> 
> It may increase a little.  It's very unlikely to be removed
> completely - handling arbitrarily long keys is problematic in the
> B-tree code.
> 
> > > If not, how can i manage very big terms? For example, we store the
> > > "paths" of your objects in the database. These paths can be very
> > > long:
> > > "I/affaires-generales/ressources-humaines/formation/Concours/Acces-au-grade-de-technicien-superieur-principal-de-l'industrie-et-des-mines/Acces-au-grade-de-technicien-superieur-principal-de-l'industrie-et-des-mines/concours-TSPIM-septembre-2007.pdf"
> > > And this is very pratical for us to index it.
> > 
> > Honestly, does that need to be indexed?
> 
> Indeed - you only need to generate a term for this if you want to be
> able to locate documents by it.  But if this path is the unique ID for
> the document, then you do need to be able to do that for updating it.
> 
> > One solution here is to to
> > what we do in omega with URIs, and use a reduced version (including
> > a hash of the complete one or the redacted information) for the
> > term if it's going over the length limit.
> 
> The approach omindex takes is to not care if two long URLs with the
> same first N characters have a hash collision, which beats refusing
> to index them, but isn't ideal (if two documents collide, we only
> index one of them).
> 
> You can also handle really long paths by splitting them over multiple
> terms, as I described here recently:
> 
> http://article.gmane.org/gmane.comp.search.xapian.general/7126
> 
> Cheers,
>     Olly
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss

Thank you for your anwsers. With your propositions, we made this:

def _reduce_data(data):
    # If the data are too long, we replace it by its sha1
    if len(data) > 240:
        if isinstance(data, unicode):
            data = data.encode('utf-8')
        return sha1(data).hexdigest()
    # All OK, we simply return the data
    return data

This function is called during the indexing and searching. This seems to
work. The problem is the "path" is your ID for each document.

Best regards,
David V.

-- 
David Versmisse
Itaapy <http://www.itaapy.com>         Tel +33 (0)1 42 23 67 45
9 rue Darwin, 75018 Paris              Fax +33 (0)1 53 28 27 88
_______________________________________________
Itaapy mailing list
Itaapy at ikaaro.org
http://mail.ikaaro.org/mailman/listinfo/itaapy