[Xapian-discuss] Perl binding: crash & missing functions?
Sander Pilon
sander@pilon.com
Tue, 4 May 2004 21:58:17 +0200
> -----Original Message-----
> From: Alex Bowley [mailto:alex@ixion.tartarus.org] On Behalf
> Of Alex Bowley
> Sent: Tuesday, May 04, 2004 14:01
> To: Sander Pilon
> Subject: Re: [Xapian-discuss] Perl binding: crash & missing functions?
>
> On Sun, May 02, 2004 at 07:00PM, Sander Pilon wrote:
> > First of all, when I add +/- 6000 documents (small ones, avg. less
> > then 200
> > words) it crashes.
> > (It justs quits with "Aborted".)
> >
> > When I do this is batches of 500, it doesn't. (add 500,
> quit process,
> > add another 500, etc) Adding a flush() every few hundred
> documents or
> > even closing and opening the database doesn't help. Help?
>
> Hmmm. Which version of xapian are you using? 0.8.0?
> Seach::Xapian is 0.0.5, I assume?
>
Correct.
> Any chance you could mail me some sample code / input data?
> (I'll understand if this is confidential)
Neither the code or the data is confidential. It's just the data is, well,
large.
(Too much to mail.)
I could give you access to the mysql database (this *WOULD* be confidential
:), as it's on a fast server. But before I do, let me explain somewhat more.
First,the machine I used to test on - a celeron 350 with 256Mb ram, linux
2.4.20 (debian).
I can (repeatedly) make it crash after X documents. Meaning that I can reset
the database, and if I repeat the steps that made it crash last time it will
crash again.
Now, my first thought would've been that something in a specific document
makes it crash. It doesn't seem that way, though. Because if I do a run of
6000 documents, it crashes at document 5999. If I do 6 runs of 1000
documents, it crashes in run 6, document 999. (Same document.) If I run 12
runs of 500, it completes just fine.
And now for the weird part.
Just to make sure it wasn't my rather old hardware, I installed a brand new
debian testing (sarge) installation in a vmware session on my rather new
athlon 2600+ with 1G ram, etc. The VMWare session has 384Mb RAM. The first
thing I noticed is that runs that make it crash on the celeron, don't make
it crash in vmware. But before you go "ooh, his hardware is flakey!" ......
Other runs *DO* make it crash. o_O'
Could it be unicode-related? (The documents I'm trying to index could
contain unicode (UTF-8))
Are there certain terms Xapian doesn't like? (Still, no excuse for "Aborted"
... )
> ...... (snip)
>
> I'm just about to start hacking on a new version of
> Search::Xapian. I'll make sure these methods get wrapped
> correctly. I'll let you know when it's been uploaded.
>
Thanks.
Below is my rather primitive (don't laugh, it's my first one and I haven't
written perl in well over two years) indexer that makes it go boom...
http://www.shacknews.com/sander/indexer.txt
It's not much more complicated then a split on whitespace on the articles,
then remove the stopwords, strip punctiation and add terms with increasing
termpos, then add the document to xapian, repeat.