[Xapian-discuss] Xapian and indexing text with layout

Sun Apr 30 12:57:41 BST 2006

On Sun, Apr 30, 2006 at 12:43:08PM +0100, Lionel wrote:
> If I look for a term, it returns it ok, but each paragraph is returned as a
> separate documentID, that's is not good, due that for me and my application
> logic each PDF it's a single document itself.

That's only how simpleindex works - there's no requirement in Xapian
to produce a Document for each paragraph, you choose what you want to
make a "Document".  The intention is that simpleindex is a simple
dummy example meant to show how you might use the Xapian API so we want
to keep down the amount of code which identifies documents, etc.

Perhaps it would be clearer as an indexer which takes a list of
filenames on the command line - that would probably have a similar
amount of non-Xapian-related code and should more closely match what
many users are trying to do.

> If I remove part of the code to ignore paragraphs, it wont be any problem
> with the indexing?

That's fine, or you might find it easier to write your code from a clean
start, just using simpleindex to see how to call Xapian (that's really
how it's intended to be useful).

So you want to start a fresh Xapian.Document for each PDF file, and only
call Xapian.Database.add_document when you've finished handling that PDF
file.

Cheers,
    Olly