Amount of writes during index creation

Sun Feb 3 11:04:06 GMT 2019

Bron Gondwana writes:
 > Indexing to tmpfs is nice because there is no disk Io! So I would guess as
 > much as you have memory for at once. We have one index per user, and we have
 > never had a user so big that we can't fit their index in memory, so initial
 > index creation is always build entire index in memory then compact it to
 > archive.

The user who presented the issue seems to be creating a huge index (one of
the tests was stopped with an index size of almost 250 GB). Depending on
local conditions, using tmpfs may force performing many merges. I don't
know how efficient this would be compared to creating several dbs of
similar size on disk and then either merging them or querying them in
parallel. Experimentation needed...

Also, depending on how the source data is organized, it may not be simple
to segment it in small enough pieces, and Recoll has nothing to help with
this.

 > On Sun, Feb 3, 2019, at 10:07, Jean-Francois Dockes wrote:
 > 
 >     Bron Gondwana writes:
 >     > This is quite possibly part of the underlying write explosion that we
 >     ran into when we wrote:
 >     > 
 >     > https://fastmail.blog/2014/12/01/email-search-system/
 >     > 
 >     > Which now almost 5 years on, has been running like a champion! We're
 >     really pleased with how well it works. Xapian reads from multiple
 >     databases are really easy, and the immediate writes onto tmpfs and daily
 >     compacts work really well. We also have a cron job which runs hourly and
 >     will do immediate compacts to disk from memory if the tmpfs hits more than
 >     50% of its nominal size, and it keeps us from almost ever needing to do
 >     any manual management as this thing indexed millions of new emails per day
 >     across our cluster.
 >     > 
 >     > And then when we do the compact down to disk, it's a single thread
 >     compacting indexes while new emails still index to tmpfs, so there's
 >     always tons of IO available for searches.
 >     > 
 >     > I think even with more efficient IO patterns, I'd still stick with the
 >     design we have. It's really nice :)
 >     > 
 >     > Bron.
 > 
 >     Thank you for these informations.
 >    
 >     I re-ran the 20 GB index creation with the latest xapian git code but a
 >     much smaller commit threshold (20 MB instead of 200). There were more than
 >     800 GB of data written (instead of 125 GB).
 >    
 >     So it would seem that the right approach for creating big indexes is to:
 >    
 >     - Always set the commit interval as high as the available RAM allows.
 >    
 >     - Use the future Xapian 1.4.10, the patch brings a significant
 >     improvement.
 >    
 >     - Segment the index, then use xapian-compact to merge if needed. It would
 >       be interesting to see how the fastmail approach works for an initial
 >     bulk
 >       index creation, compared to just segmenting, that is, what is the
 >     optimal
 >       number of merges?
 >    
 >     JF
 > 
 > -- 
 >   Bron Gondwana
 >   brong at fastmail.fm
 >