Amount of writes during index creation

Bron Gondwana brong at fastmail.fm
Sun Feb 3 09:22:30 GMT 2019


Indexing to tmpfs is nice because there is no disk Io! So I would guess as much as you have memory for at once. We have one index per user, and we have never had a user so big that we can't fit their index in memory, so initial index creation is always build entire index in memory then compact it to archive.

On Sun, Feb 3, 2019, at 10:07, Jean-Francois Dockes wrote:
> Bron Gondwana writes:
> > This is quite possibly part of the underlying write explosion that we ran into when we wrote:
> > 
> > https://fastmail.blog/2014/12/01/email-search-system/
> > 
> > Which now almost 5 years on, has been running like a champion! We're really pleased with how well it works. Xapian reads from multiple databases are really easy, and the immediate writes onto tmpfs and daily compacts work really well. We also have a cron job which runs hourly and will do immediate compacts to disk from memory if the tmpfs hits more than 50% of its nominal size, and it keeps us from almost ever needing to do any manual management as this thing indexed millions of new emails per day across our cluster.
> > 
> > And then when we do the compact down to disk, it's a single thread compacting indexes while new emails still index to tmpfs, so there's always tons of IO available for searches.
> > 
> > I think even with more efficient IO patterns, I'd still stick with the design we have. It's really nice :)
> > 
> > Bron.
> 
> 
> Thank you for these informations.
> 
> I re-ran the 20 GB index creation with the latest xapian git code but a
> much smaller commit threshold (20 MB instead of 200). There were more than
> 800 GB of data written (instead of 125 GB).
> 
> So it would seem that the right approach for creating big indexes is to:
> 
> - Always set the commit interval as high as the available RAM allows.
> 
> - Use the future Xapian 1.4.10, the patch brings a significant improvement.
> 
> - Segment the index, then use xapian-compact to merge if needed. It would
>  be interesting to see how the fastmail approach works for an initial bulk
>  index creation, compared to just segmenting, that is, what is the optimal
>  number of merges?
> 
> JF
> 

-- 
 Bron Gondwana
 brong at fastmail.fm


More information about the Xapian-discuss mailing list