[Xapian-discuss] xapian doc_id & duplicate documents

Olly Betts olly at survex.com
Tue Oct 18 00:28:55 BST 2005


On Mon, Oct 17, 2005 at 09:26:47PM +0000, Salem Berhanu wrote:
> I' want to make sure I don't index a document more than once and so wanted 
> to put a check before adding a document to the database. It doesn't look 
> like I can choose my own doc_id so I was wandering how I can check if a 
> document is already indexed before adding it to the database. Had I been 
> allowed to set the doc_id I could easily query it.

Provided you're using Xapian 0.8.2 or later you can specify the docid to
use - just call Xapian::WritableDatabase::replace_document() with the
docid you want to use.

Hmm, looks like the doxygen generated API documentation wasn't updated
to mention this.  I'll attend to that.

Note that it's probably unwise to do this if the "foreign" docids are
sparse, because you'll undermine the postlist compression in the
backend.  The odd gap isn't an issue; but using a 32 bit hash of a
string as the docid is rather unwise.  If you're somewhere in between
try both and let us know which works best for your situation!

Note that indexing is also faster if new document ids appear in
ascending order, so if that's easy to arrange, it's worthwhile doing
so for a large system.

If you have sparse numeric ids, it's probably best to use the same
technique you would for non-numeric unique ids - i.e. add a uid term for
each document (Q-prefix is standard for this) and use the version of
Xapian::WritableDatabase::replace_document() which takes a term instead
of a docid.  There's a matching version of
Xapian::WritableDatabase::delete_document() too.

Cheers,
    Olly



More information about the Xapian-discuss mailing list