[Xapian-discuss] Document::set_data() Limitations?

Richard Boulton richard at lemurconsulting.com
Mon Jun 25 08:42:48 BST 2007


David wrote:
> I'm wondering if there is any limitations (hard or soft) to what you can shove
> into Document's set_data? 
> 
> Can I put in binary data? Or is it really just meant for text? Is there a
> practical limit to how much information we can put in there?

Binary data is fine, but the system will attempt to compress data put 
into a document; this will work correctly if the data is already 
compressed, but it might be worth turning off the compression to avoid 
wasting CPU.

> I suspect that I'll be putting in quite a lot, as in a couple to maybe a hundred
> MB. Is this silly?

Quite possibly, though it may work.

There is an upper limit imposed on the maximum length of data which can 
be stored in the document data, but it's not simple to give a value for 
it.  Currently, I believe the limit is:

((block_size - 19) / 4 - key_length - 7) * 65536

Where block_size is the length of a block in the database (which is 8092 
by default) and key_length is the length of the key being used in the 
table to look up the document data; this will usually be around 4 bytes.

This comes out to about 125 MB, so storing 2 MB is fine, but 200 Mb will 
be a problem.

It's probably a mistake to try storing that much data, anyway; while it 
should work, you'll end up with a single very large file in the Xapian 
database directory holding the records, which might be a pain when 
taking backups, etc.  Also, Xapian doesn't provide you with any ability 
to perform randomly access on the document data - you have to read it 
all into memory to access it: if the data was stored in a file, the 
operating system can access it much more efficiently.

Without knowing details of what you're trying to do, I'd probably 
recommend that you store the data for each document in a separate file, 
and store a pointer to the file in the document data.

> I'm still in the investigation stage, and would just like to know where my
> limits are so I can design this properly.

-- 
Richard



More information about the Xapian-discuss mailing list