[Xapian-discuss] What kind of data in the datafield

Jim k4gvo at bellsouth.net
Fri Jan 5 09:13:57 GMT 2007


Felix Antonius Wilhelm Ostmann wrote:
>> .
> this is the way i think. the user can see a short text (400 byte) with
> the terms in the query.

Since you can't anticipate what terms are going to be searched in
advance, you can't store a short text with the terms as you suggest. 
There are a couple of ways to handle this.  As someone pointed out
earlier, you could store the entire document but that's pretty
expensive, IMHO.

I've elected to parse the files of interest after xapian has returned
the hit list, find the search terms in the document and include a few
words before and after the terms, sort of like google displays them. 
There is a little overhead, but it's not objectionable.  That said, my
documents are not extremely large.

This scheme wouldn't work well with large documents, unless you kept the
documents already parsed somewhere.

Here's a novel idea, I just had.  Parse each (hit) document into words,
eliminating special characters, as close to the way QueryParser does it
as possible.  Store each word in a row in a memory resident sqlite
database along with its position in the file.  Create indexes on both
the term and the position.  For each term search with something like
"select pos from mydb where term='lucky';"  Store each of those
positions in a list, vector or array, depending on the language of
choice.  When you're finished, you can "select term from mydb where
position <N-5 and position >N+5;" or whatever the equivalent sql is in
sqlite.  This will give you a set of phrases you can use to display as
the sample.

Just thinking out loud.

Cheers,
Jim.



More information about the Xapian-discuss mailing list