[Xapian-discuss] xapian doc_id & duplicate documents

Salem Berhanu salemb4 at hotmail.com
Tue Oct 18 18:26:40 BST 2005


"Provided you're using Xapian 0.8.2 or later you can specify the docid to 
use - just call Xapian::WritableDatabase::replace_document() with the docid 
you want to use"

I am but the problem is how do I know the existing docid of a document that 
is the same as the one I am about to add? The only way I can think of right 
now is to always index a unique identifier term for a document and do a 
replace_document by term when adding. (by unique identifier here I mean not 
according to Xapian but according to me, for instance the location of the 
document)
Is there a direct way of replacing the docid right before or after adding to 
the database? if not where would be a good place to store my unique 
identifier for a document. This is also important when querying since I need 
to link the doc_id to my unique identifier.
Thanks
S







>From: Olly Betts <olly at survex.com>
>To: Salem Berhanu <salemb4 at hotmail.com>
>CC: xapian-discuss at lists.xapian.org
>Subject: Re: [Xapian-discuss] xapian doc_id & duplicate documents
>Date: Tue, 18 Oct 2005 00:28:55 +0100
>
>On Mon, Oct 17, 2005 at 09:26:47PM +0000, Salem Berhanu wrote:
> > I' want to make sure I don't index a document more than once and so 
>wanted
> > to put a check before adding a document to the database. It doesn't look
> > like I can choose my own doc_id so I was wandering how I can check if a
> > document is already indexed before adding it to the database. Had I been
> > allowed to set the doc_id I could easily query it.
>
>Provided you're using Xapian 0.8.2 or later you can specify the docid to
>use - just call Xapian::WritableDatabase::replace_document() with the
>docid you want to use.
>
>Hmm, looks like the doxygen generated API documentation wasn't updated
>to mention this.  I'll attend to that.
>
>Note that it's probably unwise to do this if the "foreign" docids are
>sparse, because you'll undermine the postlist compression in the
>backend.  The odd gap isn't an issue; but using a 32 bit hash of a
>string as the docid is rather unwise.  If you're somewhere in between
>try both and let us know which works best for your situation!
>
>Note that indexing is also faster if new document ids appear in
>ascending order, so if that's easy to arrange, it's worthwhile doing
>so for a large system.
>
>If you have sparse numeric ids, it's probably best to use the same
>technique you would for non-numeric unique ids - i.e. add a uid term for
>each document (Q-prefix is standard for this) and use the version of
>Xapian::WritableDatabase::replace_document() which takes a term instead
>of a docid.  There's a matching version of
>Xapian::WritableDatabase::delete_document() too.
>
>Cheers,
>     Olly





More information about the Xapian-discuss mailing list