[Xapian-devel] Starting work on Perf Test Module

James Aylett james-xapian at tartarus.org
Wed May 14 10:02:36 BST 2014


On 14 May 2014, at 05:38, Aarsh Shah <aarshkshah1992 at gmail.com> wrote:

> Questions
> -> If anyone has an experience with dowbloading wikipedia dumps, please can I get some advice on how to go about doing it and which is the best place to get them ?

https://en.wikipedia.org/wiki/Wikipedia:Database_download

Use the XML dumps (and just the pages-articles version); the SQL ones are a nightmare. I've always had good success with bittorrent for wikipedia dumps.

Note that you'll probably then need to run each article through a Mediawiki syntax parser to interpret (or possibly just strip) macros and formatting commands before tokenisation. There's a list of libraries at http://www.mediawiki.org/wiki/Alternative_parsers which includes some python ones, although I haven't used them. (I've used a couple of ruby ones, and the quality is highly variable, so you may need to play with different ones to get what you need.)

J

-- 
 James Aylett, occasional trouble-maker
 xapian.org




More information about the Xapian-devel mailing list