[Xapian-discuss] Updating existing documents

Luis Zarrabeitia kyrie at uh.cu
Mon May 18 17:18:27 BST 2009


On Sunday 17 May 2009 10:40:11 am Richard Boulton wrote:
> 2009/5/17 Ivo Jansch - Ibuildings <ivo at ibuildings.nl>:
> > hi,
> >
> > this maybe a very newbie question, but couldn't find it in the docs;
> > I've built my spider and run it a few times, but every document is now
> > in the index multiple times. While there is a get_id on Documents, I can
> > see no set_id; how do I tell it that it's indexing an existing doc when
> > I index a document?
>
> Use replace_document() instead of add_document() - this allows you to
> specify the id.

Another newbie, related question: 
  How can you get the ID of the of the document to replace, given the new
  document?

The OP's problem seems to be that when he crawls the second time, he is 
indexing the same documents (and thus, they appear twice in the database). 
Thus, the second time he finds the document, he must know the ID that was 
assigned the first time (I recently had a similar situation[1], where I was 
trying to use document titles - this case seems to be similar, only with URLs 
instead of titles). Should the OP (or I) keep an external mapping URL->doc_id 
(and be careful with xapian-compacts), or is there a better way?


[1] http://lists.xapian.org/pipermail/xapian-discuss/2009-May/006679.html

-- 
Luis Zarrabeitia (aka Kyrie)
Fac. de Matemática y Computación, UH.
http://profesores.matcom.uh.cu/~kyrie



More information about the Xapian-discuss mailing list