[Xapian-discuss] indexing strategy for "near real time" indexing

Olly Betts olly at survex.com
Tue Jun 19 19:01:41 BST 2007


On Thu, Jun 14, 2007 at 05:33:19PM -0400, Jarrod Roberson wrote:
> I am working on a proof of concept real time email indexer using
> xapian.  This is for HUGE volumes, think ISP level.  I have to come up
> with a strategy for indexing the messages as they come in as near real
> time as I can.
>
> I am considering indexing into many databases based on time and / or
> size, and then trying to xapian-compact them together at the end of
> the day, and start over. The single writer limitation is what I am
> trying to address.

My thoughts would be to dump a copy of each message to be indexed into a
spool directory (or directory hierarchy), and have the indexer process
run through the spool.  Either one message per file, or perhaps better
in batches.

That way a sudden surge of email doesn't overwhelm the system - it just
creates a temporary backlog of unindexed mail.  And the indexer can be
temporarily taken off-line without having to halt mail delivery or miss
indexing messages.

You need to be able to indexer faster than messages arrive on average,
and ideally fast enough to keep up with all but the peaks of demand - if
necessary, you can run multiple indexers with a spool each and add new
messages to each in a round-robin way.  You can combine databases with
xapian-compact when it's quieter as you suggest.

Cheers,
    Olly



More information about the Xapian-discuss mailing list