[Xapian-discuss] Perl binding: crash & missing functions?

Sander Pilon sander@pilon.com
Tue, 4 May 2004 21:58:17 +0200


 

> -----Original Message-----
> From: Alex Bowley [mailto:alex@ixion.tartarus.org] On Behalf 
> Of Alex Bowley
> Sent: Tuesday, May 04, 2004 14:01
> To: Sander Pilon
> Subject: Re: [Xapian-discuss] Perl binding: crash & missing functions?
> 
> On Sun, May 02, 2004 at 07:00PM, Sander Pilon wrote:
> > First of all, when I add +/- 6000 documents (small ones, avg. less 
> > then 200
> > words) it crashes. 
> > (It justs quits with "Aborted".)
> > 
> > When I do this is batches of 500, it doesn't. (add 500, 
> quit process, 
> > add another 500, etc) Adding a flush() every few hundred 
> documents or 
> > even closing and opening the database doesn't help. Help?
> 
> Hmmm. Which version of xapian are you using? 0.8.0? 
> Seach::Xapian is 0.0.5, I assume?
> 

Correct. 

> Any chance you could mail me some sample code / input data? 
> (I'll understand if this is confidential)

Neither the code or the data is confidential. It's just the data is, well,
large. 
(Too much to mail.)

I could give you access to the mysql database (this *WOULD* be confidential
:), as it's on a fast server. But before I do, let me explain somewhat more.

First,the machine I used to test on - a celeron 350 with 256Mb ram, linux
2.4.20 (debian).

I can (repeatedly) make it crash after X documents. Meaning that I can reset
the database, and if I repeat the steps that made it crash last time it will
crash again. 

Now, my first thought would've been that something in a specific document
makes it crash. It doesn't seem that way, though. Because if I do a run of
6000 documents, it crashes at document 5999. If I do 6 runs of 1000
documents, it crashes in run 6, document 999. (Same document.) If I run 12
runs of 500, it completes just fine.

And now for the weird part. 

Just to make sure it wasn't my rather old hardware, I installed a brand new
debian testing (sarge) installation in a vmware session on my rather new
athlon 2600+ with 1G ram, etc. The VMWare session has 384Mb RAM. The first
thing I noticed is that runs that make it crash on the celeron, don't make
it crash in vmware. But before you go "ooh, his hardware is flakey!" ......
Other runs *DO* make it crash. o_O' 

Could it be unicode-related? (The documents I'm trying to index could
contain unicode (UTF-8))
Are there certain terms Xapian doesn't like? (Still, no excuse for "Aborted"
... )

> ...... (snip)
> 
> I'm just about to start hacking on a new version of 
> Search::Xapian. I'll make sure these methods get wrapped 
> correctly. I'll let you know when it's been uploaded.
> 

Thanks.

Below is my rather primitive (don't laugh, it's my first one and I haven't
written perl in well over two years) indexer that makes it go boom...

http://www.shacknews.com/sander/indexer.txt

It's not much more complicated then a split on whitespace on the articles,
then remove the stopwords, strip punctiation and add terms with increasing
termpos, then add the document to xapian, repeat.