[Xapian-discuss] Numbers format, anything special?

Olly Betts olly at survex.com
Thu Jan 24 02:57:55 GMT 2008


On Mon, Jan 14, 2008 at 11:37:31PM +0100, Yannick Warnier wrote:
> Is there anything special I should know about indexing numbers with
> Xapian (any "interpretation" of any kind)?

Xapian doesn't do much "interpretation" of terms - mostly it just treats
them as opaque blobs of data (the exceptions are the stemmers and the
use of term prefixes by TermGenerator/QueryParser).

Or perhaps I've misunderstood, and you're asking what a good strategy is
for indexing numbers?  That depends a lot on what your users will want
to be able to search for.  In some applications, you can get away
without indexing numbers at all.  Sometimes just 4 digit numbers which
look like years are all that you need.

Sometimes any length number, or even any string of numbers and
digits needs to be indexed - e.g. telephone numbers, ISBNS, UK post
codes (which look like "W1A 4WW"), error codes, part numbers, package
tracking code, memory addresses in error messages, GPG key fingerprints,
...

If you're indexing email or the web, it's useful to impose a length
limit on such terms to avoid indexing lines of uuencoded or base64
encoded data.

Cheers,
    Olly



More information about the Xapian-discuss mailing list