[Xapian-discuss] Xapian index size 475GB = 170 million documents (URLs)

Kevin Duraj kevin.softdev at gmail.com
Sat Dec 18 23:58:14 GMT 2010


Xapians,

I am maintaining about two indexes for my search engines which
approximately is each the same size. I would like to share this
knowledge with you, since many of you have never seen Xapian index of
this size. And of course you can search the index by yourself at

- http://myhealthcare.com/
- http://find1friend.com/

I need 2 x 100 million more documents into each index, and I hope it
will fit on one hard disk of 2TB, and I will soon beat single handedly
the largest Xapian BrightStation's Webtop search engine implementation
(archive.org snapshot), which offered a sub-second search over around
500 million web pages (around 1.5 terabytes of database files).
Reference: http://xapian.org/history

One sample index size:

total 475G
-rw-r--r-- 1 kevin kevin   28 2010-12-18 15:25 iamchert
-rw-r--r-- 1 kevin kevin   13 2010-12-18 12:19 position.baseA
-rw-r--r-- 1 kevin kevin 3.8M 2010-12-18 15:25 position.baseB
-rw-r--r-- 1 kevin kevin 240G 2010-12-18 15:25 position.DB
-rw-r--r-- 1 kevin kevin   13 2010-12-18 04:31 postlist.baseA
-rw-r--r-- 1 kevin kevin 923K 2010-12-18 11:36 postlist.baseB
-rw-r--r-- 1 kevin kevin  58G 2010-12-18 11:36 postlist.DB
-rw-r--r-- 1 kevin kevin   13 2010-12-18 11:36 record.baseA
-rw-r--r-- 1 kevin kevin 1.6M 2010-12-18 12:03 record.baseB
-rw-r--r-- 1 kevin kevin 102G 2010-12-18 12:02 record.DB
-rw-r--r-- 1 kevin kevin   13 2010-12-18 12:03 termlist.baseA
-rw-r--r-- 1 kevin kevin 1.2M 2010-12-18 12:19 termlist.baseB
-rw-r--r-- 1 kevin kevin  76G 2010-12-18 12:18 termlist.DB

$ delve .
number of documents = 169346678
average document length = 230970
document length lower bound = 1
document length upper bound = 3585385
highest document id ever used = 169346678

Kevin Duraj
http://pacificair.com/



More information about the Xapian-discuss mailing list