[Xapian-discuss] preformance issue

Sat Dec 23 03:20:52 GMT 2006

On Thu, Dec 21, 2006 at 05:48:20PM +0800, Andrey Kong wrote:
> I started to have some slow preformance issues after my DB Docs to 450,000, 
> it takes 3 - 11 secs for a query now.
> I think there are something wrong in my structure...

First question, is this with quartz or flint?  I'd suggest using flint
rather than quartz.

Second question - does compacting the database help?  (Using quartzcompact
for quartz, or xapian-compact for flint).

> database total terms count:1,954,698
> num of Docs: 470,000 (approx 30-300 terms per doc)
> 
> postlist_DB file size: 1.3G
> position_DB file size: 2.8G
> record_DB file size: 4.6G
> termlist_DB file size: 1.1G
> 
> i wonder the 1,954,698 terms in my DB is normal or too much garbage?

Sounds reasonable enough (Gmane has nearly 40 million unique terms!)

> the contents of the doc are basically stripped tags webpages, in Chinese 
> (segmented)
> the query is simple OP_OR e.g. (google OR PTITLE:google OR yahoo OR 
> PTITLE:yahoo OR msn OR PTITLE:msn)

Oh, I'd assumed it would be phrase queries.  This is odd.  Do you see a
query running much faster if you repeat it?

What do you get reported with the "blocks read" patch?

http://www.oligarchy.co.uk/xapian/patches/flint-count-read-blocks.patch

> Dev. Server:
> INTEL p4 2.8 HT
> 2G ram
> 120G IDE  7200rpm 2M cache raid 1

What's the operating system?

Is it just the search running on this, or is there anything else using
it?  If so, what else?

> BTW, since processing each captured webpages contents (pre-index process) is 
> very CPU demanding, do u have any suggestion on some sort of 'parallel
> computing / computer custerling / GRID' solution from your experience
> in building a search engine?

I usually favour a "decoupled" design - fetch and process your documents
and dump them in batches to files in a spool directory.  Then an
indexing process can sweep through the directory periodically.  The
spool directory can be NFS mounted or copied over with rsync or
some other mechanism if you want to run processes on different machines.

I like this because you can run several fetching processes on a single
spool directory (with sensible file naming conventions) and you can
start and stop fetchers and the indexer independently.

If you're building something big enough, Google's massively parallel
use of map/reduce is worth considering.  From what I've read, they
use it for a lot of things, but then they have the advantage of having
already built the software and hardware infrastructure for it so I'm not
sure it's the best answer if you're building a system from scratch.

http://labs.google.com/papers/mapreduce.html

> What is the Cost of searching from more than 1 xapian DB? Is it a good idea 
> to break down one DB into 2DBs if option available?

I'm not aware of any benchmarking of this, but it seems likely that
searching the same data split over 2 DBs will be slower since the 2 DBs
will tend to be bigger than a single DB (so more I/O) and because
there's a little more calculation required (so more CPU).

If you put the split databases on different servers and use the remote
backend, that could be faster (though splitting into more than 2 would
make more sense I think).

Also if you often need to search a subset, splitting the database can
make sense.

But you shouldn't really need to be thinking about splitting in this
case.

> eg.
> 
> DB [thread title + thread content]
> 
> VS
> 
> DB[thread title]
> DB[thread content]
> 
> (people normally search for thread title + thread content , BUT there is 
> option to search for ONLY thread title)

I'm not sure I understand - are you treating the thread title and content
as separate documents?  How does that work?

Cheers,
    Olly