[Xapian-discuss] add_posting(): term position significance - line or offset?
Olly Betts
olly at survex.com
Wed Nov 19 11:58:02 GMT 2008
On Tue, Nov 18, 2008 at 06:41:10PM +0000, Richard Boulton wrote:
> It could be implemented simply by using the existing database structures
> (ie, the stored term postings). However, the performance of such
> searches might not be terribly good - currently the performance of
> phrase searches is significantly worse than that of non-phrase searches,
> because the positional information is stored separately, so requires
> extra disk accesses.
I don't think it's as simple as "because it is stored separately".
Whatever you do, you need to read more data (the positional
information), regardless of where it is stored. You can store it
alongside the posting, but then reading the posting won't tend to pull
in the next chunk of postings as much as it currently does.
There may be additional locality issues with pulling it in from a
different file. As far as I know, nobody has tried to quantify those
yet.
> If we moved to a sitation where we were accessing the positional
> information for pretty much every search term, we would probably be
> better off moving the positional information into the main posting
> lists. However, this would slow down non-positional searches.
The positional information won't be relevant to a single term query
(since you need two or more terms to make it useful to look at the
relative positions). So it's overhead for that case.
For a multi-term AND-query at least, you still don't want the positional
information for every posting, only those for documents where the other
terms occur. Ditto for NEAR and PHRASE since they are essentially
AND-plus-positional-filter.
What you say is true for a multi-term OR-query, but only until it decays
into a different operator.
Cheers,
Olly
More information about the Xapian-discuss
mailing list