[Xapian-discuss] [ NUMBER OF SAMPLE ]

Tue Aug 10 15:47:25 BST 2004

Boris Meyer wrote:
> The solution could be the retrieving of the words/phrases offset in 
> the document and the extraction from this offset with a fork (x char 
> before/x after) in combination with a document local weight algorythm 
> if more than one match in the same document.

It would be fairly easy to implement an adaptor class so that the
positionlist for a term in a document could be used as a posting list.
Then the Xapian matcher can be used.

Ideally you probably want to pull up the sentences that match best, but
then you have a problems mapping term positions into "sentence ids".  A
reasonable approximation might be for the adaptor class to return
matches for (say) 10 words before and after each position.

So then run a second mini-match for each hit you want to produce a
dynamic sample for, which would return a ranked list of term positions.
Take the top few and pull out the sentences that they're in, stringing
them together in order to give the sample.  This requires the raw text
of the document, but you can just store that in the Xapian database,
along with a way to map from a term position to the text of a sentence
within the raw text containing that term position.

Hmm, we could use a sorted list giving the byte offset of the start of
each sentence (so you can binary chop to map the term position to the
byte offsets of the start and end of the sentence containing that term).
But it would probably be better to use this sorted list to map from term
position to sentence id - after all, we can just walk the list as we
walk the positionlists, and it substantially reduces the number of ids
the adaptor class needs to return and the matcher needs to handle.  The
list would also compress well then (if we don't need to binary chop it
we can compress it like we compress a posting list).

I rather like that actually.  I've had this scheme in mind in a slowly
evolving form for a while, but it somehow hasn't felt right yet.  But
now it is starting to...

> I'm diving into the Api, looking for some methods to retrieve this offset.

The best way to do this without adding more API is probably to store the
whole raw document as the sample, then tweak omega's highlighting to
discard sentences with no matching terms, and terminate after a certain
amount of sample has been written.  That will pick the first sentences
which match, rather than those which match best - it'll probably work
pretty well in practice though.

> As HD are now low cost and as everybody today is looking for a google 
> meaninful result listing with highlighted terms, I would also store a 
> such index. But maybe is there another way ?

Disk space may be cheap, but connecting it to a computer such that you
can get at large amounts of data quickly pushes the cost up.

The gap between the I/O speed of disks and the speed of processors and
memory is actually widening, and a large search system will end up I/O
bound, so you need to think carefully about the amount of disk I/O
you're going to cause...

So reusing the term position information we already store for phrase
and proximity searching is a good plan.

Cheers,
    Olly