[Xapian-discuss] xapian indexing size?
olly at survex.com
Thu May 5 19:25:55 BST 2005
On Thu, May 05, 2005 at 08:09:41PM +0200, rm at fabula.de wrote:
> On Thu, May 05, 2005 at 01:54:18PM -0400, John Paige wrote:
> > Yes, I was expecting that to be smaller than the corpus size.
I think an index smaller than the corpus is totally achievable. As you
say, other systems manage it.
Xapian needs to store document length (to implement BM25) and termlists
(to have an updateable index) -- some other systems use weighting schemes
which don't require document length and/or require you to create an
index all at once, so needn't store this information.
But for example, quartz stores the document length in every posting list
entry, which is rather wasteful. It does mean it is always handy, but
the efficiency gained from that is surely lost in the extra I/O in
requires. That's something I'm planning on changing...
Updateability also requires a more complex database structure, though
I'm not convinced that inevitably means much more space overhead.
More information about the Xapian-discuss