<p dir="ltr">I should have thought about this complexity while writing my proposal :) But thanks for the advice , Ill look into it . I think we need them for the stemming . The inex data set will be much more helpful for all the other tests such as querying and searching .</p>
<p dir="ltr">Regards <br>
Aarsh </p>
<div class="gmail_quote">On 14/05/2014 2:32 pm, "James Aylett" <<a href="mailto:james-xapian@tartarus.org">james-xapian@tartarus.org</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 14 May 2014, at 05:38, Aarsh Shah <<a href="mailto:aarshkshah1992@gmail.com">aarshkshah1992@gmail.com</a>> wrote:<br>
<br>
> Questions<br>
> -> If anyone has an experience with dowbloading wikipedia dumps, please can I get some advice on how to go about doing it and which is the best place to get them ?<br>
<br>
<a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download" target="_blank">https://en.wikipedia.org/wiki/Wikipedia:Database_download</a><br>
<br>
Use the XML dumps (and just the pages-articles version); the SQL ones are a nightmare. I've always had good success with bittorrent for wikipedia dumps.<br>
<br>
Note that you'll probably then need to run each article through a Mediawiki syntax parser to interpret (or possibly just strip) macros and formatting commands before tokenisation. There's a list of libraries at <a href="http://www.mediawiki.org/wiki/Alternative_parsers" target="_blank">http://www.mediawiki.org/wiki/Alternative_parsers</a> which includes some python ones, although I haven't used them. (I've used a couple of ruby ones, and the quality is highly variable, so you may need to play with different ones to get what you need.)<br>
<br>
J<br>
<br>
--<br>
James Aylett, occasional trouble-maker<br>
<a href="http://xapian.org" target="_blank">xapian.org</a><br>
<br>
</blockquote></div>