[Xapian-discuss] high update-frequency strategy

Jan jan at griebsch.net
Thu Aug 13 08:18:40 BST 2009

Hi Everyone,

I'm evaluating Xapian for the following -hard- use-case:

1) document structure: avg. 100kb full-text, 5x meta-data a 100bytes, 3x
bool. flags
2) big index, i.e. full-text volume ~ 1TB/disk (2x HD, mirrored)
3) low query-frequency (<1/sec)
4) 10 inserts/sec (on a 4core host)
5) *high-update frequency of meta-data* mostly onto the bool. flags:

Requirements 3 and 4 are no problem, inserts can be cached and mostly
steered towards bulk disk I/O when the load allows for it.

The question is, if 5) can be achieved. It seems that an
	updateMyDoc(myDocId, meta-key, meta-value)

implementation, invariably ends up running some variation of the
following by the (Flint) backend:

	docid = query(myDocId)
	doc get_document(docid)
	// "updating" then maps to:
	* replace doc's meta-data in-memory
	* delete(mark-deleted ?) old doc in the index
	* re-insert the new doc

The last two ops work on the index cache. The bottleneck seems to be the
get_document operation which apparently causes (un-cached**) disk seeks.

**Our RAM/Disk quotient is too small for the OS disk cache to be effective.

Is there any way to make get_document "lazier" i.e. not do lookups in
the persistent index - and do the meta-date replace "dirty" i.e. simply
write the new value in the cache and don't make it persistent until
flush() ?

What are the performance dis-/advantages of modeling meta-data as
prefix-terms vs. document values ?

Did I leave out any important constraints/facts ?
Otherwise: Any help, hints, experiences would be *greatly* appreciated.


More information about the Xapian-discuss mailing list