[Xapian-devel] Starting work on Perf Test Module
Aarsh Shah
aarshkshah1992 at gmail.com
Wed May 14 10:06:47 BST 2014
I should have thought about this complexity while writing my proposal :)
But thanks for the advice , Ill look into it . I think we need them for the
stemming . The inex data set will be much more helpful for all the other
tests such as querying and searching .
Regards
Aarsh
On 14/05/2014 2:32 pm, "James Aylett" <james-xapian at tartarus.org> wrote:
> On 14 May 2014, at 05:38, Aarsh Shah <aarshkshah1992 at gmail.com> wrote:
>
> > Questions
> > -> If anyone has an experience with dowbloading wikipedia dumps, please
> can I get some advice on how to go about doing it and which is the best
> place to get them ?
>
> https://en.wikipedia.org/wiki/Wikipedia:Database_download
>
> Use the XML dumps (and just the pages-articles version); the SQL ones are
> a nightmare. I've always had good success with bittorrent for wikipedia
> dumps.
>
> Note that you'll probably then need to run each article through a
> Mediawiki syntax parser to interpret (or possibly just strip) macros and
> formatting commands before tokenisation. There's a list of libraries at
> http://www.mediawiki.org/wiki/Alternative_parsers which includes some
> python ones, although I haven't used them. (I've used a couple of ruby
> ones, and the quality is highly variable, so you may need to play with
> different ones to get what you need.)
>
> J
>
> --
> James Aylett, occasional trouble-maker
> xapian.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140514/ffdd2ecd/attachment.html>
More information about the Xapian-devel
mailing list