[Xapian-devel] Omega changes
James Aylett
james-xapian at tartarus.org
Fri Dec 17 17:04:21 GMT 2004
On Fri, Dec 17, 2004 at 04:47:14PM +0000, Olly Betts wrote:
> > If you get duplicates while doing that, you want to overwrite, [...]
> > Is there something else I'm missing?
>
> The issue isn't duplicates - it's that you don't want to REMOVE all the
> files from "/" when you index "/~olly/"...
Ah. That's not really how I intended ignore to be used. Lucky I
haven't tried to use it since you made it delete everything it doesn't
replace - I might have been upset with the results :-)
How often do people remove documents? If we make it delete only if
told to, then you skip that step and save some time. And memory,
actually - using a vector<bool> is fairly compact, but will start
becoming significant in a very large corpus over time. If I'm never
deleting documents I can save quite a bit of memory.
> > Yeek. I've just looked at the deletion code, and I'm not convinced it
> > works.
>
> Hmm, the mechanism is only meant to be used in the DUPE_replace case,
> but the code to remove the documents fires whatever dupes is set to.
> That's a bug. It also rather suggests nobody uses anything but
> the default of DUPE_replace!
:-)
> But your point may mean that a "assume still existing documents aren't
> modified" flag wouldn't be a win since we'd want the document id in
> this case anyway. Or that it should do exactly what DUPE_ignore is
> meant to, and only add unindexed documents.
I think that's more sane than trying to delete documents on ignore, to
be honest. But probably pulling these bits apart and being more
explicit about what happens would be a big win.
> > If we want to remove --duplicates=ignore, we'll have to profile that
> > against different OS/FS combinations. While --duplicates=ignore will
> > certainly be faster, I can imagine quite a variance of stat/compare on
> > different systems. Probably better to leave it in somehow, probably by
> > replacing --duplicates with --ignore-duplicates.
>
> Actually, I suspect the stat time will be swamped by the time to find
> the modtime from Xapian. Although perhaps not if lots is cached.
I'm expecting lots of the database to be successfully cached. Were you
thinking of putting this in a value, or in the document data? The
former might cache better in this context, but that isn't a terribly
good argument for putting it there.
> > > > is there anything sane we can do with different stemmers in use?
> >
> > You could do it in the query, in simple cases, couldn't you? Or would
> > it mess up ranks in the MSet?
>
> So replace TERM1 with:
>
> (en_stem(TERM1) OR fr_stem(TERM1) OR de_stem(TERM1))
>
> The problem there is that stems may collide, and a french word could
> match a totally unrelated english word which happens to have the same
> stem. I don't know if this is a big problem in reality.
Well, my feeling is that this is a last-ditch attempt to support users
who aren't being explicit in what they want. I'd be inclined further
to only do it on short queries.
> You could add a boolean term to each database which specifies the
> language and indexes every single document.
I was kind of assuming that, actually. :)
> > But users won't bother to tell you which language they're searching
> > in. I suppose you could try to autodetect that, but it won't work well
> > with typical short queries, I suspect.
>
> I think really you need to make them bother, rather than trying to
> cobble together some heath-robinson scheme which slows down every
> search to allow them not to.
That's why I don't think you should hit every search with it. If they
tell you the language, do it properly, otherwise try this in an effort
to get something useful for them. Or you could just auto-detect the
language from their browser, and give them the option to change it. If
it's prominent, that's not a bad solution at all.
J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Xapian-devel
mailing list