<p dir="ltr">I should have thought about this complexity while writing my proposal :) But thanks for the advice , Ill look into it . I think we need them for the stemming . The inex data set will be much more helpful for all the other tests such as querying and searching .</p>


<p dir="ltr">Regards <br>

Aarsh </p>

<div class="gmail_quote">On 14/05/2014 2:32 pm, "James Aylett" <<a href="mailto:james-xapian@tartarus.org">james-xapian@tartarus.org</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On 14 May 2014, at 05:38, Aarsh Shah <<a href="mailto:aarshkshah1992@gmail.com">aarshkshah1992@gmail.com</a>> wrote:<br>

<br>

> Questions<br>

> -> If anyone has an experience with dowbloading wikipedia dumps, please can I get some advice on how to go about doing it and which is the best place to get them ?<br>

<br>

<a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download" target="_blank">https://en.wikipedia.org/wiki/Wikipedia:Database_download</a><br>

<br>

Use the XML dumps (and just the pages-articles version); the SQL ones are a nightmare. I've always had good success with bittorrent for wikipedia dumps.<br>

<br>

Note that you'll probably then need to run each article through a Mediawiki syntax parser to interpret (or possibly just strip) macros and formatting commands before tokenisation. There's a list of libraries at <a href="http://www.mediawiki.org/wiki/Alternative_parsers" target="_blank">http://www.mediawiki.org/wiki/Alternative_parsers</a> which includes some python ones, although I haven't used them. (I've used a couple of ruby ones, and the quality is highly variable, so you may need to play with different ones to get what you need.)<br>


<br>

J<br>

<br>

--<br>

 James Aylett, occasional trouble-maker<br>

 <a href="http://xapian.org" target="_blank">xapian.org</a><br>

<br>

</blockquote></div>