[Xapian-discuss] Xapian and 10M (small) documents. What to expect?
Bart van Bragt
xapian at vanbragt.com
Thu Sep 8 11:44:14 BST 2005
I've been thinking about integrating phpBB with Xapian for quite some
time now and I guess I really should start to get things rolling. I
haven't had a decent search on my site (www.bokt.nl) for ages now
and the users are getting pretty annoyed by that fact :)
I'm currently trying to figure out how I'm going to set this up, I
probably also first need to get some new hardware to facilitate search.
Does anyone have an idea what kind of hardware I would need to
search 10 million documents (approx 4GB of text) with approx 10.000 new
postings per day?
Real-time indexing is nice but I can also batch this up so we can do
this during the night (servers are mostly idle during the night:
http://status.bokt.nl/ ).
Do I need a dedicated machine for searching? The site isn't exactly
generating huge amounts of money so it would be very nice if we could
use a (beefy) server to do both webserving and searching. Or
combine the database and the search but I don't think those two combine
really well, I'm guessing that the main bottleneck is going to be I/O?
Does anyone have experience with integrating Xapian with (PHP) forums? I
know Arjan has plenty of experience with gathering.tweakers.net :D
Talking about which... I'd very much prefer to index individual postings
instead of combining all posts in a topic to one document. The main
reason for this is that combining large topics results in lots of hits
on those large topics because they contain a LOT of search terms. This
is my main grief when searching on gathering.tweakers.net, you have to
wade through lots of 300 page topics that do contain your searchwords
but in quite separate postings on separate pages. Most of the times
those 300 page topics have no link at all with the subject that
searching for. IMO searching in postings instead of topics should solve
that problem. The main drawback is a (very significant?) performance
loss I guess... Indexing topics would result in only 500k documents
instead of 10M.
There seems to be a fairly large resemblance between gmain and phpBB
indexing (both are about indexing topics/threads and lots of small
postings). Is the gmane setup going to be public? Is it already
known what hardware this system is going to need?
Thanks in advance!
Bart van Bragt
More information about the Xapian-discuss
mailing list