[Xapian-discuss] Omega, flint, big stubs, big samples, missing dates, htdig_noindex

Olly Betts olly at survex.com
Fri Aug 25 16:53:31 BST 2006


On Tue, Aug 08, 2006 at 02:12:12PM +0100, James Aylett wrote:
> On Mon, Aug 07, 2006 at 09:48:53PM -0700, Jeff Breidenbach wrote:
> > Which is likely to be faster on a single machine, one gigantic
> > database or a few thousand smaller databases glued together
> > via a stub db?
> 
> It depends on things like how your disks are laid out, what your data
> looks like, and other concerns. How long does it take you to build
> your db? If it's under a day, I'd recommend doing both and testing
> each way.

For a smaller number of databases I'd probably agree, but I don't think
trying to search a few thousand databases together is really viable.
Even is you can open enough file handles, the time to open that many
databases is going to mount up.

The split databases will also use quite a lot more disk space I'd imagine.

> > Is there a prayer of Omega storing a larger (e.g. configurable size)
> > sample someday so that one can get a better summary result? Or
> > is that idea doomed due to backwards compatibility issues?
> 
> Umm. Do we have an internal limit on the document data? I can quite
> happily add data several times the size of the underlying database
> block size, so I'm guessing no (effectively).

There's a limit, but it's pretty big.  For quartz the limit is somewhat
less than 16384 * blocksize, so with the default 8K block size, the
document data can be getting on for 128M.

> I think it's just that the number of words / characters / whatever is
> hard coded in omindex.cc, so it's pretty easy to change. Note that
> with scriptindex you have a configurable truncation limit.

Yeah, there's no good reason why the sample size stored by omindex can't
be configurable too.  Then we just need some code to pick out the best N
sentences from this based on where the matching terms are.

> > How can one successfully use END in an Omega query, but not
> > see document dates in the summary results or even the date
> > field at all in godmode?
> 
> Sorry, not sure I understand what you're asking...

I think Jeff's asking about date range filtering.  Omega does this by
generating a set of date terms for each document, and then building
a boolean expression from these to represent the desired range of
dates.  See the "Boolean terms" section here for details:

http://svn.xapian.org/trunk/xapian-applications/omega/docs/overview.txt?view=markup

Note that I don't think the "W" terms are currently used.

Date filtering could be done with a MatchDecider object instead.
Omega's current approach predates MatchDeciders - it's how we used to
do date range filtering using Muscat 3.6!

Cheers,
    Olly



More information about the Xapian-discuss mailing list