[Xapian-discuss] number of terms for a document

Richard Boulton richard at lemurconsulting.com
Mon Apr 23 14:42:17 BST 2007


Andreas Marienborg wrote:
> Hello again :)
> 
> Is there any sort of recomended limit to the number of (boolean) terms 
> one adds? like if I add, say, 10 000 different U<number> terms to a 
> document, would all searching be significantly slower?

Xapian should be able to cope fine with that sort of number of terms. 
If all documents have that number of terms, searching would be slower 
than for a database in which documents have fewer terms, but that's 
simply because the whole database would be larger, and thus searches 
would tend to access more disk.  A search will typically only need to 
touch the posting lists for terms which appear in the query; other terms 
in the database will only affect performance due to how they affect the 
distribution of disk pages which need to be read.

You might find that _indexing_ is noticeably slower for a document with 
a very large number of terms - because, of course, each term has to be 
inserted into the appropriate place in the database.  In particular, if 
you're inserting lots of documents with very large numbers of terms 
you'll probably want to flush to disk a bit more frequently than if 
you're inserting smaller documents, because the buffer of changes in 
memory will get larger more quickly as you add documents.  But it should 
continue to go at a reasonable speed - I wouldn't expect slowdown to be 
significantly worse than linear with the number of terms in the 
document.  (Actual measured numbers for this would be fascinating, if 
anyone has them.)

-- 
Richard



More information about the Xapian-discuss mailing list