MediaWiki to Omega

aldennisa15 xapian5485 at aldennis.me.uk
Sat Feb 24 09:59:29 GMT 2018


I use Omega to index and search an archive of magazine and ebook pdfs 
etc. I also have a Wiki (in MediaWiki) that I wanted to include in that 
index too.

If it's any use to anybody, I've adapted dbi2omega to export the pages 
from MediaWiki and shared it on GitHub - search for mediawiki2omega.

It doesn't do anything very clever, but it might save someone time 
figuring out the MediaWiki database and the scriptindex fields. Feel 
free to correct me if I've not understood the xapian fields properly!

I'm sure it could be improved, for example doing something with 
categories; exporting Talk:, User: etc namespaces; removing/converting 
wiki markup in some way. But what it does now works just great for me. I 
wasn't looking to replace the (awful) MediaWiki search (I use 
SphinxSearch for that, which is vastly better than the built in search). 
I just wanted a single search point for finding those nuggets I knew I 
have hidden away somewhere.

As an aside, why are the scriptindex field definitions defined in a 
separate file? Couldn't they go in-line before the data (sort of like 
column names in a header line in a CSV are)? When parsing the data 
stream, things like caption:xxx would be read as field definitions, then 
caption=xxx would be text for indexing - not difficult surely? It would 
mean a converter such as mediawiki2omega could generate a single stream 
that could just be piped into scriptindex, without needing to use a 
separate script file for the field definitions.




More information about the Xapian-discuss mailing list