[Xapian-discuss] [ NUMBER OF SAMPLE ]
Boris Meyer
boris.meyer at rom.fr
Thu Jul 22 10:17:50 BST 2004
Hello Eric,
Eric B. Ridge wrote:
> On Jul 21, 2004, at 12:59 PM, Boris Meyer wrote:
>
>> I'm diving into the Api, looking for some methods to retrieve this
>> offset.
>
> It ain't there! :) The best you can do is get the "positional" data,
> which I'm willing to bet is "word position" with Omega.
Right.
>>> Right now one must re-parse the document, joining up with the terms
>>> list from the result to find and highlight any/all hits, let alone
>>> context extraction. A fairly expensive operation if you're doing to
>>> do this on a "summary display" of many documents.
>>
>> Yes a very consuming process, especially when the average size of the
>> documents I would have to parse is known, 3Mo (Pdf), don't forget the
>> x10 results/page please ;-).
>
> PDF text extraction is a pain in the ass. I've got a handwritten PDF
> parser (in Java) that does a decent job of text extraction (better than
> xpdf in raw mode, in my opinion), but it's not perfect by any means.
The best way to parse Pdf is "object" access (mean block by block).
> And this is another gotcha. Even if Xapian did support tracking byte
> offsets of terms, for what you want to do the offsets would need to be
> offsets in the text version of the PDF, not the PDF itself. And where
> is the text version of the PDF stored?
Right... I was thinking, eventually, use Xapian Api to index, store and
query parse, but when result Mset obtained, iterating on each result pdf
file with a sort of 'pdftotext | grep word', very heavy.
>> As HD are now low cost and as everybody today is looking for a google
>> meaninful result listing with highlighted terms, I would also store a
>> such index. But maybe is there another way ?
>
> I don't know what Omega will let you do, but using Xapian's API, when
> you add a term to a document you can optionally give it positional
> information. The intent I'm sure is for the position to be word
> position. Xapian uses this for proximity searching. You could instead
> use byte offsets, but your options for proximity go away. There's
> little meaning in "documents where 'foo' and 'bar' are within XX bytes
> of each other".
I'll look to this option, many thanks for your help.
> eric
--
Cordialement, Boris.
+---------------------------+----------------------+
| Boris Meyer | Tel : 04 93 92 88 88 |
| Administration / Internet | Fax : 04 93 92 18 93 |
| Developpement | Web : http://rom.fr |
+---------------------------+----------------------+
| 19, bd Carabacel | - - - - - x - - - - |
| 06000 Nice | - - - - - x - - - - |
+---------------------------+----------------------+
| boris.meyer at rom.fr | http://www.rom.fr |
+---------------------------+----------------------+
More information about the Xapian-discuss
mailing list