[Xapian-discuss] Best Practices for Compaction?

Arjen van der Meijden acmmailing at tweakers.net
Sat Sep 19 16:17:46 BST 2009


It obviously depends on your source of data. Ours is a webforum where 
all topics and postings are stored in a database any way. So not 
bothering about the data untill the next cron-interval occurs is 
perfectly ok for us.

If you get some stream of data which you need to catch and then process, 
you may indeed want to queue it up in some form of storage. We've good 
experience with the ActiveMQ message queueing system, but using a normal 
database should work as well.

In the case of ActiveMQ you could just pause (or disconnect) your 
consumer for a while when doing the compaction while the producers 
continu to offer data to ActiveMQ.
In a database you'll have to roll your own queuing or similar system, 
but it may be easier to use especially in case of batch processing and such.

Best regards,

Arjen

On 19-9-2009 16:45 Kenneth Loafman wrote:
> Richard Boulton wrote:
>> 2009/9/18 Arjen van der Meijden <acmmailing at tweakers.net
>> <mailto:acmmailing at tweakers.net>>
>>
>>     Hi Ken,
>>
>>     We're only updating the database in intervals, not continuously.
>>     What we're doing is basically:
>>     [symlink for the database is to compacted database]
>>     update "working" database
>>     change symlink for database to "working"
>>     compact working to a new compact database
>>     change symlink for database back to compact
>>
>>
>> I recommend using a stub-database file instead of a symlink - that way,
>> if a reader has opened some of the database files but not others when
>> the symlink changes, you don't get an inconsistent set of database files
>> being opened.
>>
>> There's a variety of swapping schemes like this: I've used various
>> different schemes, depending on what requirements for updating speed and
>> search speed I'm trying to satisfy.
> 
> There's still the issue of a two hour downtime, if I understand things
> correctly.  During the compaction the source is locked and the target is
> not usable, so collection has to stop, or be queued through another
> mechanism, correct?
> 
> I'm only updating the database in intervals, but I have to collect it
> within a one-hour interval of when it was produced, or it goes away.
> Two hours of downtime would mean at least one hour of lost activity
> unless I'm misunderstanding the whole link/swap process.
> 
> I'm thinking of using MySQL as a frontend so I can get 24/7 collection,
> but wanted to avoid the staging complexity if possible.  Having query
> access to the database unavailable for two hours is OK, but losing data
> collection is not.
> 
> ...Thanks,
> ...Ken
> 



More information about the Xapian-discuss mailing list