[Xapian-discuss] [ NUMBER OF SAMPLE ]

Thu Jul 22 10:17:50 BST 2004

Hello Eric,

Eric B. Ridge wrote:

> On Jul 21, 2004, at 12:59 PM, Boris Meyer wrote:
> 
>> I'm diving into the Api, looking for some methods to retrieve this 
>> offset.
> 
> It ain't there!  :)  The best you can do is get the "positional" data, 
> which I'm willing to bet is "word position" with Omega.

Right.

>>> Right now one must re-parse the document, joining up with the terms 
>>> list from the result to find and highlight any/all hits, let alone 
>>> context extraction.  A fairly expensive operation if you're doing to 
>>> do this on a "summary display" of many documents.
>>
>> Yes a very consuming process, especially when the average size of the 
>> documents I would have to parse is known, 3Mo (Pdf), don't forget the 
>> x10 results/page please ;-).
> 
> PDF text extraction is a pain in the ass.  I've got a handwritten PDF 
> parser (in Java) that does a decent job of text extraction (better than 
> xpdf in raw mode, in my opinion), but it's not perfect by any means.

The best way to parse Pdf is "object" access (mean block by block).

> And this is another gotcha.  Even if Xapian did support tracking byte 
> offsets of terms, for what you want to do the offsets would need to be 
> offsets in the text version of the PDF, not the PDF itself.  And where 
> is the text version of the PDF stored?

Right... I was thinking, eventually, use Xapian Api to index, store and 
query parse, but when result Mset obtained, iterating on each result pdf 
file with a sort of 'pdftotext | grep word', very heavy.

>> As HD are now low cost and as everybody today is looking for a google 
>> meaninful result listing with highlighted terms, I would also store a 
>> such index. But maybe is there another way ?
> 
> I don't know what Omega will let you do, but using Xapian's API, when 
> you add a term to a document you can optionally give it positional 
> information. The intent I'm sure is for the position to be word 
> position. Xapian uses this for proximity searching. You could instead 
> use byte offsets, but your options for proximity go away. There's 
> little meaning in "documents where 'foo' and 'bar' are within XX bytes 
> of each other".

I'll look to this option, many thanks for your help.

> eric

-- 
Cordialement, Boris.
+---------------------------+----------------------+
| Boris Meyer               | Tel : 04 93 92 88 88 |
| Administration / Internet | Fax : 04 93 92 18 93 |
| Developpement             | Web : http://rom.fr  |
+---------------------------+----------------------+
| 19, bd Carabacel          | - - - - - x - - - -  |
| 06000 Nice                | - - - - - x - - - -  |
+---------------------------+----------------------+
| boris.meyer at rom.fr        | http://www.rom.fr    |
+---------------------------+----------------------+