[Xapian-devel] Re: writabledatabase_delete_document()

Alexander Lind malte at webstay.org
Mon Dec 4 19:54:54 GMT 2006


> I take it you're trying to do with from a PHP script run through apache?
>   
Its via a php CLI script.
> Flushing updates to a large database can take a while - it's easy to
> update 4 of the 5 tables, but for the postlist table we need to amend
> an entry for each term which indexed the document, which is potentially
> a lot of entries widely spread through the table.
>
> Not that the cost here scales sublinearly - i.e. it's a lot less than 100
> times as expensive to delete 100 documents and flush them in one go than
> to delete and flush one document at a time.  Perhaps that's something
> you can bear in mind when designing how deletions are handled?
>   
Definitely, not using flushes anywhere, just letting a script do all
updates and deletes in a row, and flushing on its own. Should I make it
so all deletes are done by themselves, ie without updates interwoven?
>   
>> From my laymens point of view, it seems that the xapian delete document
>> function freezes up the OS on a filesystem level. Is this a correct
>> assessment?
>>     
>
> No.  My best guess is you're probably either suffering from a lot of
> swapping, or just disk I/O overload.  Or there may be issues to do with
> this running inside Apache too.
>   
Apache is not involved here, but swapping and disk i/o are both likely
pof:s here, especially since the machine is quite busy with serving a
pretty large site at the same time.
> Calling WritableDatabase::delete_document() doesn't do much work at all.
> The postlist table updates are saved up for when you call
> WritableDatabase::flush() (either explicitly, or implicitly as happens
> when you close a database, or after 1000 updates).  Some memory is
> require to buffer up these changes, but I doubt that's causing you
> to swap unless you're really tight on memory.
>
> A flush can require a lot of I/O, so the other tasks apache is doing are
> also I/O bound, you could cause these to run more slowly such that more
> try to run at once and things get slow.  But this is not really Xapian's
> fault - the server is just overloaded.  You either need to add more RAM
> to improve caching and reduce I/O, beef up the disk subsystem so I/O
> requests complete sooner, or move work off to another server.
>
>   
Yep, moving to a different machine that can be dedicated to this is
probably my best shot.
> As for running this inside Apache, if I were designing something like
> this, I'd lean towards having the web interface lodge a request with a
> index management process which does the real work behind the scenes
> (this could be as simple as saving a file in a directory saying to
> delete document N, and having the index management process scan the
> directory for new files).  Then the index management process can batch
> up multiple requests, and different users can make requests at the same
> time without locking issues.
>   
Its free-standing php cli scripts doing this behind the scenes job for
me, and it also uses a bunch of tables in a mysql db to know what it
should do - add, update or delete documents in the xapian index.
I know the mysql server (other machine) is not the bottleneck, but
rather it must be that the server that I run the scripts on is simply
overloaded, just as you say.
In fact I know this is the case many times as when I kill other resource
hogging scripts the xapian indexer speeds up. However it did not do
anything at all for when I was trying to delete documents the other day.
But that of course was because I was passing the docid as a string, not
an int. Smart eh? :p

>> I just read a post somewhere on the net about how you can use term names
>> to ID and delete items out of an xapian index instead of using a
>> document_id. Is this faster?  (would seem strange if it was).
>>     
>
> Using a term will be slower.  Currently the internal implementation is
> to just open the postlist for the term and then call delete_document for
> each docid but there's scope for optimising this in the future.  But
> even with this optimised I expect it will be faster to delete by docid.
>   
Thats what I thought. Cool.
> Hmm, if you're calling from PHP, make sure that you're passing a PHP
> integer to delete_document, not a PHP string.  If you pass a string
> you'll be calling the "delete_document by term" variant.  If you
> want to make sure, you can force a value to be an integer by using:
> intval($docid)
>   
Yep that was the root of a long session of bug-hunting the other night.
Can't believe I didn't figure it out sooner though, being that I had
already made sure docids are passed as int:s to the update function.
>   
>> Is the answer to my question here to split the data into multiple databases?
>>     
>
> It might help, since you say this worked with smaller databases, but I'd
> try moving other work off the server first.  Your database isn't that
> enormous.
>   
Can't do that until I have set up a new machine, but the multiple-db
patch is already done and implemented so will see how that pans out. The
index is rebuilding right now. Exciting :)
>   
>> Technically I know how to do it, but not logically. How many databases
>> should I aim for - ie, should I aim for them not to be over a certain
>> size, contain a certain amount of documents, or something else?
>>     
>
> It depends on the spec of the server really.  When I'm building gmane's
> index from scratch, I build a number of databases of 1 million documents
> each and then merge them at the end, as I found that was fastest (though
> I didn't do extensive trials).
>   
I tried with 250k docs per sub-db this time. But I have made it so I can
adjust this limit without rebuilding the entire db, so it can change later.

Question: how do you merge the sub-db:s in the end, for the search
functions?

The machine I use is a dual p3 1200 mhz xeon, and 3.5 gig ram.
It is quite overworked serving a busy website, various stats-generating
scripts, and other misc scripts. Moving the xapian stuff to a different
server is my next step.
>   
>> Should I distribute documents sequentially into them, or randomly, or
>> use some other scheme?
>>     
>
> If you ever want to search a particular subset, you could put that in
> its own database (or databases).  And if you need to remove old
> documents periodically (e.g. anything over a year old) doing it by
> date makes sense.  Otherwise it's probably rather arbitrary.
>   
Yeah documents could stay in the index for a day or for 5 years, so no
need for me to think about that then.

Thank you so much for your help Olly, very much appreciated.

Alec
> Cheers,
>     Olly
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/pipermail/xapian-devel/attachments/20061204/e3d68e87/attachment.htm


More information about the Xapian-devel mailing list