[Xapian-discuss] number of terms for a document
Richard Boulton
richard at lemurconsulting.com
Mon Apr 23 14:42:17 BST 2007
Andreas Marienborg wrote:
> Hello again :)
>
> Is there any sort of recomended limit to the number of (boolean) terms
> one adds? like if I add, say, 10 000 different U<number> terms to a
> document, would all searching be significantly slower?
Xapian should be able to cope fine with that sort of number of terms.
If all documents have that number of terms, searching would be slower
than for a database in which documents have fewer terms, but that's
simply because the whole database would be larger, and thus searches
would tend to access more disk. A search will typically only need to
touch the posting lists for terms which appear in the query; other terms
in the database will only affect performance due to how they affect the
distribution of disk pages which need to be read.
You might find that _indexing_ is noticeably slower for a document with
a very large number of terms - because, of course, each term has to be
inserted into the appropriate place in the database. In particular, if
you're inserting lots of documents with very large numbers of terms
you'll probably want to flush to disk a bit more frequently than if
you're inserting smaller documents, because the buffer of changes in
memory will get larger more quickly as you add documents. But it should
continue to go at a reasonable speed - I wouldn't expect slowdown to be
significantly worse than linear with the number of terms in the
document. (Actual measured numbers for this would be fascinating, if
anyone has them.)
--
Richard
More information about the Xapian-discuss
mailing list