[Xapian-discuss] xapian doc_id & duplicate documents
Salem Berhanu
salemb4 at hotmail.com
Tue Oct 18 18:26:40 BST 2005
"Provided you're using Xapian 0.8.2 or later you can specify the docid to
use - just call Xapian::WritableDatabase::replace_document() with the docid
you want to use"
I am but the problem is how do I know the existing docid of a document that
is the same as the one I am about to add? The only way I can think of right
now is to always index a unique identifier term for a document and do a
replace_document by term when adding. (by unique identifier here I mean not
according to Xapian but according to me, for instance the location of the
document)
Is there a direct way of replacing the docid right before or after adding to
the database? if not where would be a good place to store my unique
identifier for a document. This is also important when querying since I need
to link the doc_id to my unique identifier.
Thanks
S
>From: Olly Betts <olly at survex.com>
>To: Salem Berhanu <salemb4 at hotmail.com>
>CC: xapian-discuss at lists.xapian.org
>Subject: Re: [Xapian-discuss] xapian doc_id & duplicate documents
>Date: Tue, 18 Oct 2005 00:28:55 +0100
>
>On Mon, Oct 17, 2005 at 09:26:47PM +0000, Salem Berhanu wrote:
> > I' want to make sure I don't index a document more than once and so
>wanted
> > to put a check before adding a document to the database. It doesn't look
> > like I can choose my own doc_id so I was wandering how I can check if a
> > document is already indexed before adding it to the database. Had I been
> > allowed to set the doc_id I could easily query it.
>
>Provided you're using Xapian 0.8.2 or later you can specify the docid to
>use - just call Xapian::WritableDatabase::replace_document() with the
>docid you want to use.
>
>Hmm, looks like the doxygen generated API documentation wasn't updated
>to mention this. I'll attend to that.
>
>Note that it's probably unwise to do this if the "foreign" docids are
>sparse, because you'll undermine the postlist compression in the
>backend. The odd gap isn't an issue; but using a 32 bit hash of a
>string as the docid is rather unwise. If you're somewhere in between
>try both and let us know which works best for your situation!
>
>Note that indexing is also faster if new document ids appear in
>ascending order, so if that's easy to arrange, it's worthwhile doing
>so for a large system.
>
>If you have sparse numeric ids, it's probably best to use the same
>technique you would for non-numeric unique ids - i.e. add a uid term for
>each document (Q-prefix is standard for this) and use the version of
>Xapian::WritableDatabase::replace_document() which takes a term instead
>of a docid. There's a matching version of
>Xapian::WritableDatabase::delete_document() too.
>
>Cheers,
> Olly
More information about the Xapian-discuss
mailing list