[Xapian-devel] Re: writabledatabase_delete_document()

Olly Betts olly at survex.com
Mon Dec 4 18:53:15 GMT 2006


On Fri, Dec 01, 2006 at 12:44:37PM -0800, Alexander Lind wrote:
> When I try to remove them now (using writabledatabase_delete_document()
> via php), it halfway freezes up the machine, and the apache httpd runs
> amok spawning more and more children, until I break the php script that
> is trying to remove documents from xapian.

I take it you're trying to do with from a PHP script run through apache?

Flushing updates to a large database can take a while - it's easy to
update 4 of the 5 tables, but for the postlist table we need to amend
an entry for each term which indexed the document, which is potentially
a lot of entries widely spread through the table.

Not that the cost here scales sublinearly - i.e. it's a lot less than 100
times as expensive to delete 100 documents and flush them in one go than
to delete and flush one document at a time.  Perhaps that's something
you can bear in mind when designing how deletions are handled?

> From my laymens point of view, it seems that the xapian delete document
> function freezes up the OS on a filesystem level. Is this a correct
> assessment?

No.  My best guess is you're probably either suffering from a lot of
swapping, or just disk I/O overload.  Or there may be issues to do with
this running inside Apache too.

Calling WritableDatabase::delete_document() doesn't do much work at all.
The postlist table updates are saved up for when you call
WritableDatabase::flush() (either explicitly, or implicitly as happens
when you close a database, or after 1000 updates).  Some memory is
require to buffer up these changes, but I doubt that's causing you
to swap unless you're really tight on memory.

A flush can require a lot of I/O, so the other tasks apache is doing are
also I/O bound, you could cause these to run more slowly such that more
try to run at once and things get slow.  But this is not really Xapian's
fault - the server is just overloaded.  You either need to add more RAM
to improve caching and reduce I/O, beef up the disk subsystem so I/O
requests complete sooner, or move work off to another server.

As for running this inside Apache, if I were designing something like
this, I'd lean towards having the web interface lodge a request with a
index management process which does the real work behind the scenes
(this could be as simple as saving a file in a directory saying to
delete document N, and having the index management process scan the
directory for new files).  Then the index management process can batch
up multiple requests, and different users can make requests at the same
time without locking issues.

> I just read a post somewhere on the net about how you can use term names
> to ID and delete items out of an xapian index instead of using a
> document_id. Is this faster?  (would seem strange if it was).

Using a term will be slower.  Currently the internal implementation is
to just open the postlist for the term and then call delete_document for
each docid but there's scope for optimising this in the future.  But
even with this optimised I expect it will be faster to delete by docid.

Hmm, if you're calling from PHP, make sure that you're passing a PHP
integer to delete_document, not a PHP string.  If you pass a string
you'll be calling the "delete_document by term" variant.  If you
want to make sure, you can force a value to be an integer by using:
intval($docid)

> Is the answer to my question here to split the data into multiple databases?

It might help, since you say this worked with smaller databases, but I'd
try moving other work off the server first.  Your database isn't that
enormous.

> Technically I know how to do it, but not logically. How many databases
> should I aim for - ie, should I aim for them not to be over a certain
> size, contain a certain amount of documents, or something else?

It depends on the spec of the server really.  When I'm building gmane's
index from scratch, I build a number of databases of 1 million documents
each and then merge them at the end, as I found that was fastest (though
I didn't do extensive trials).

> Should I distribute documents sequentially into them, or randomly, or
> use some other scheme?

If you ever want to search a particular subset, you could put that in
its own database (or databases).  And if you need to remove old
documents periodically (e.g. anything over a year old) doing it by
date makes sense.  Otherwise it's probably rather arbitrary.

Cheers,
    Olly



More information about the Xapian-devel mailing list