[Xapian-discuss] indexing

Sun Jan 14 14:15:52 GMT 2007

On Thu, Jan 11, 2007 at 05:02:36AM -0500, Jim wrote:
> Jamie D wrote:
> > 1. Am I right in thinking that to index a document Xapian must be
> > passed each word in a document separately along with the position the
> > word appears in the text?

> Not necessarily.  A call to QueryParser parses a complete document.

Um, QueryParser parses queries, not documents!

Assuming you want positional information, you do have to pass each word
separately with the position it occurs at.  If you aren't interested
in phrase searching, you only need to pass it once (probably with the
number of times it occurs).

Some built in mechanism is on the todo list.  It would save everyone
having to implement word tokenisation for themselves, and would be a
speed benefit when using the bindings as it would mean you'd only
have to call to C++ once to index a block of text, not once per word
which means a lot less argument converting and checking, etc.

> > 3. Lastly, is it possible to index on one machine, then copy the
> > database files to another machine and search them without any issues?

As Richard said, databases are intended to be completely architecture
independent but there's a bug in flint which means this isn't
completely true for flint databases in released versions.

The bug is in the encoding of position lists, where an architecture (and
possibly compiler) dependent list of values get encoded and decoded
using an extra bit.  This list includes commonly encountered values on
x86, but probably not on other architectures (e.g. on x86_64 only 2
enormous values are affected which are unlikely to exist in real data).
Sadly x86 is probably the most common architecture of course...

More details here:

http://thread.gmane.org/gmane.comp.search.xapian.general/3542/focus=3613

Cheers,
    Olly