[Xapian-discuss] storing documents in Xapian vs. external store (when other indexes are needed)

Olly Betts olly at survex.com
Tue Jan 20 00:23:44 GMT 2009

On Mon, Jan 19, 2009 at 01:46:58PM +0100, Marinos Yannikos wrote:
> for a set of documents that are indexed with Xapian for fast search and
> also with external (hash/B-Tree etc., like tokyocabinet) indexes for fast
> access by value, is it a good idea to store the whole document in Xapian's
> DB and fetch it by Xapian's doc_id after searching in the external index,
> or the other way round, i.e. store the document somewhere else and use
> some external oid as the Xapian "document"?

My usual advice is to store the document externally if you need to
access it externally (e.g. if you have an existing SQL-based system
it's likely easier not to have to change it to pull the data out of
Xapian instead).  Otherwise you might as well put it in Xapian.

> In other words/short version: is Xapian/Flint good for storing documents
> even if they are often fetched by doc_id?

Yes, the document data is stored in a Btree keyed by the document id.

> - possibly slower retrieval by some other indexed value if fetching from
> Flint by doc_id is slower than the external storage solution (tokyocabinet
> etc.)

I've not compared, but I'd expect it to be competitive.  If you
benchmark I'd be interested to see results.

> - bigger DB, perhaps slower access

It's a separate table, so shouldn't make a difference to matching.  The
OS will have more Xapian data to consider caching, but that's probably
equivalent to the data from the external store it would have to consider
caching if you used one (if that's on the same machine at least).

> - document changes are probably slower even if the indexed text is not
> changed

This is an issue for flint.  Chert is already better at not rewriting
unchanged data in this case.  There's scope for further work - see this


> Any opinions/suggestions? Am I on the wrong track for storing documents
> with several indexed values + fast text search? (I know that the problem
> fits an RDBMS well, but Xapian is so much faster)

I think it's a sane option to consider for many uses.  But if you need
the relational aspects of an RDBMS or advanced SQL queries, you probably
aren't going to be satisfied with using Xapian alone.


More information about the Xapian-discuss mailing list