[Xapian-discuss] indexing strategy for "near real time" indexing

Sam Liddicott sam at liddicott.com
Tue Jun 19 21:33:09 BST 2007


Are you indexing a mail store with reference to the store to retrieve the original message, or indexing a mail spool as messages pass through. How will messages expire? What processes will have read, and write access to the store/spool.

If a spool, I suggest you modify the SMTP daemon to create hard links (in a different dir) to the queued message either when it enters, or finally leaves, or is delivered successfully to 1 (or each) reipient (depending which strategy suits best)

If a store then you probably want to track changes to the store. Maildir and mdir are simple, but mbox may require scanning whol mailboxes to look for added or removed message IDs.

As Olly points out, it's best to use a queue. You can't really do real time unless you have enough cpu to cope with unforseen peaks or unless you throttle reception by tying it to the index process.

Sam


-----Original Message-----
From: "Olly Betts" <olly at survex.com>
To: "Jarrod Roberson" <jarrod at vertigrated.com>
Cc: xapian-discuss at lists.xapian.org
Sent: 19/06/07 19:01
Subject: Re: [Xapian-discuss] indexing strategy for "near real time" indexing

On Thu, Jun 14, 2007 at 05:33:19PM -0400, Jarrod Roberson wrote:
> I am working on a proof of concept real time email indexer using
> xapian.  This is for HUGE volumes, think ISP level.  I have to come up
> with a strategy for indexing the messages as they come in as near real
> time as I can.
>
> I am considering indexing into many databases based on time and / or
> size, and then trying to xapian-compact them together at the end of
> the day, and start over. The single writer limitation is what I am
> trying to address.

My thoughts would be to dump a copy of each message to be indexed into a
spool directory (or directory hierarchy), and have the indexer process
run through the spool.  Either one message per file, or perhaps better
in batches.

That way a sudden surge of email doesn't overwhelm the system - it just
creates a temporary backlog of unindexed mail.  And the indexer can be
temporarily taken off-line without having to halt mail delivery or miss
indexing messages.

You need to be able to indexer faster than messages arrive on average,
and ideally fast enough to keep up with all but the peaks of demand - if
necessary, you can run multiple indexers with a spool each and add new
messages to each in a round-robin way.  You can combine databases with
xapian-compact when it's quieter as you suggest.

Cheers,
    Olly

_______________________________________________
Xapian-discuss mailing list
Xapian-discuss at lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss




More information about the Xapian-discuss mailing list