[Xapian-discuss] Incremental indexing
Olly Betts
olly at survex.com
Wed Mar 21 01:57:31 GMT 2012
On Tue, Mar 20, 2012 at 01:01:48PM -0400, Marios Titas wrote:
> On Tue, Mar 20, 2012 at 09:34, Olly Betts <olly at survex.com> wrote:
> > Yes it does pretty well, assuming you're using Xapian 1.2 or later
> > 1.0.x. This change made a huge difference to the speed of adding and
> > removing tags from indexed email in notmuch.
>
> In order for me to understand how this works, consider the following scenario:
> 1. I retrieve a document with db.get_document. Assume that it has n terms.
> 2. I add k terms with doc.add_term or doc.add_posting.
> 3. I write the document back to the database with db.replace_document.
> So are you saying that the above procedure takes O(k) time? What about
> memory, is it O(k) too? Or does xapian load the entire document in
> memory (in which case it would be O(n+k) for both time and memory)?
The list of terms indexing the document needs updating, so the operation
has complexity O(n), but that's misleading as in practice it's the time
to update the posting lists which dominates, and we only update those
where there are changes. It's really something more like O(n + K.k)
where K is fairly large.
> There is also another problem with this approach, it is not easy to
> remove terms from a document. It would be nice if the document class
> had method to decrement the wdf of a term and if the new wdf is zero
> then remove it altogether.
There's not a single method to do this, but if you are indexing with
positional information then Document::remove_posting() can decrease the
wdf of a term and Document::remove_term() can remove a term completely:
http://xapian.org/docs/apidoc/html/classXapian_1_1Document.html#11eec629737e496a7077b87f763967a5
So you could check if the wdf goes to zero and call remove_term() if it
does.
If you're not storing positions, you can actually call add_term() with
a negative value (which gets cast to unsigned) to reduce the wdf, though
that's not documented. Again you have to remove the term yourself though.
I think it would make sense for remove_posting() to also remove the term
if the wdf goes to zero and there are no positions left, and to provide
a way to reduce the wdf without removing positions, which also removes
the term if the wdf goes to zero (and there are no positions too,
probably).
Cheers,
Olly
More information about the Xapian-discuss
mailing list