[Xapian-discuss] Re: Re: get_docid over multi-database search

Kevin Duraj kevin.softdev at gmail.com
Wed Dec 19 22:32:52 GMT 2007


On Dec 18, 2007 3:49 AM, Olly Betts <olly at survex.com> wrote:
> On Fri, Dec 14, 2007 at 11:18:12AM -0800, Andrey wrote:
> > from my own experience, breaking up into dbs will not cause a big
> > preformance lost, like from 1sec to 2 secs, it just works like querying 1 db
> > after cached up

We are all missing the points here. There are two types of Xapian users.

1. Search engine using less than 1 million documents or data can be
fit in memory.
2. Search engine using 1-100 million documents and data is much larger
than memory.

People who are testing performance on data that can easily fit into
server memory, their data is cashed in memory and their performance
measurements is high and distorted. We must measure the performance
when searches are not cashed to memory but sitting on hard disk. Only
then we can see the real performance of searches as the hard disk
spins and find the correct data. Than OS (Linux) place the result into
cache if available. The second same search will use cache instead of
hard disk and the performance is too high and invalid.

Users of all search engines platforms are surprise that some searches
takes very long, specially those that are not in cache. Because they
run their performance on cache not on hard disk. Quickly they find
their scalability problem and broken promises. In my case having
100-500GB data on hard disk, the data cannot fit into memory and using
two databases is two times slower than using single database. That is
why I keep saying that indexing performance of single database is the
most important, because the search performance follows.

__________________________________
  Kevin Duraj
  http://UncensoredWebSearch.com



> I would be suprised if there was a large overhead - there's a bit of
> extra work from opening the databases, and a small amount from having
> a "MultiPostList".  The combined size of the split databases is usually
> a little larger than the combined one, which may increase VM pressure a
> bit.
>
> If you do profile and find there's a significant difference, it would
> be interesting to see comparable profiles for the two cases to see where
> the extra time is spent.
>
> > maybe you can try to duplicate another copy of your db and serach over them
> > together, its very easy with just 1 extra line
> > db=db.add_database(xapian.Database(''db"))
>
> You'd also need to generate the equivalent combined database (e.g. by
> using xapian-compact with the same input twice).
>
> But just duplicating the data isn't an accurate recreation of searching
> a real database split in two though.  I don't know if it actually would
> make a difference, but it might.
>
>
> Cheers,
>    Olly
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>



More information about the Xapian-discuss mailing list