[Xapian-discuss] Limitation of the terms size
Olly Betts
olly at survex.com
Wed Mar 25 03:18:32 GMT 2009
On Tue, Mar 24, 2009 at 03:00:33PM +0000, James Aylett wrote:
> On Tue, Mar 24, 2009 at 03:55:41PM +0100, David Versmisse wrote:
>
> > A small question: i found in the src that a term has a limitation of 245
> > characters (#define MAX_SAFE_TERM_LENGTH 245). Do you plan to change
> > this limitation in the future versions?
It may increase a little. It's very unlikely to be removed completely -
handling arbitrarily long keys is problematic in the B-tree code.
> > If not, how can i manage very big terms? For example, we store the
> > "paths" of your objects in the database. These paths can be very long:
> > "I/affaires-generales/ressources-humaines/formation/Concours/Acces-au-grade-de-technicien-superieur-principal-de-l'industrie-et-des-mines/Acces-au-grade-de-technicien-superieur-principal-de-l'industrie-et-des-mines/concours-TSPIM-septembre-2007.pdf"
> > And this is very pratical for us to index it.
>
> Honestly, does that need to be indexed?
Indeed - you only need to generate a term for this if you want to be
able to locate documents by it. But if this path is the unique ID for
the document, then you do need to be able to do that for updating it.
> One solution here is to to
> what we do in omega with URIs, and use a reduced version (including a
> hash of the complete one or the redacted information) for the term if
> it's going over the length limit.
The approach omindex takes is to not care if two long URLs with the same
first N characters have a hash collision, which beats refusing to index
them, but isn't ideal (if two documents collide, we only index one of
them).
You can also handle really long paths by splitting them over multiple
terms, as I described here recently:
http://article.gmane.org/gmane.comp.search.xapian.general/7126
Cheers,
Olly
More information about the Xapian-discuss
mailing list