[Xapian-discuss] Indexing PDF, DOC etc.

Thu Nov 6 12:05:23 GMT 2008

Florian Beer wrote:
> I'm trying to index PDFs that are stored in a MySQL database (blob  
> field) using omindex now.
> What's the exact call to tell omindex to index a byte stream (passed  
> directly from my Python programm) instead of specifying a directory on  
> the commandline?
>
> Is this even possible, or would I have to first write the PDF data out  
> from the MySQL to a temporary file, let it index (supplying arbitrary  
> metadata) and then delete the temp file?
>
>   
>   
 From the man page:

omindex - Index static website data via the filesystem

Omindex reads a directory hierarchy of files which represent the data 
accessible via a browser.  It's not the tool that you will want to use 
to index PDF files from within a MySQL database. 

Scriptindex may be something that you could use.  It processed a file at 
a time.  The other option is to use the Python Xapian package to 
programmatically generate an index. 

Just curious, once you have the index and are searching, what mechanism 
are you using to retrieve the documents?  E. g. Do you have a web page 
that allows you to pass a document id that then retrieves the data from 
the database?

Jim.