[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

Thu Jan 17 20:25:57 GMT 2008

The explanation sounds plausible. 

As for the indexer, no, it does not use replace_document (didn't knew about that function actually...)

This is the relevant part of the php-script:

                $doc = new XapianDocument();
                $doc->set_data($data);    
          $doc->add_value(1,$postrow['forum_id']);
          $doc->add_value(2,date('Ymd',$postrow['postdate']));
          $doc->add_value(3,$postrow['author_id']);
          //Adds a boolean term
          $doc->add_term("XFORUMID".$postrow["forum_id"]);
          $doc->add_term("XAUTHORID".$postrow["author_id"]);
          $doc->add_term("XAUTHORNAME".$postrow["forum_id"]);
          //Assign the document to the TermGenerator which will generate the terms used for searching
          $indexer->set_document($doc);
                $indexer->index_text($postrow['post']);
                $indexer->index_text($postrow['title'], 2);            

                // Add the document to the database.
                $database->add_document($doc);

                $postrow =null;
                $data =null;
                $doc =null;

So...I should probably use replace_document if I "update" existing documents? 

----- Original Message ----
From: Olly Betts <olly at survex.com>
To: athlon athlonf <athlonkmf at yahoo.com>
Cc: xapian-discuss at lists.xapian.org
Sent: Thursday, January 17, 2008 3:14:15 AM
Subject: Re: [Xapian-discuss] PHP indexing,  what's the PHP method for indexscript

On Wed, Jan 16, 2008 at 09:58:11AM -0800, athlon athlonf wrote:
> >Load 5 suggests something's wrong, because dbi2omega and scriptindex
> >are both linear processes. Are you running several instances in
> >parallel in some way?
> 
> it usually starts off fairly low, but then after half an hour of so,
> it will reach load 5 constantly.

As James says, the scriptindex process itself shouldn't raise the load
by more than 1 (since it's essentially a single process, plus one
/bin/cat child process, which will always be blocked on read except
 very
briefly when the database is opened or closed).

I suspect what is happening here is that the scriptindex process is
causing the machine to swap so that webserver requests take a lot
 longer
and so start to overlap.  Hence 4 of the load is actually due to the
webserver (although caused by scriptindex).  I can't think of another
plausible explanation anyway.

> tid : boolean=Q field=id
> pid : unique=Q boolean=Q field=pid

It doesn't seem to make a lot of sense to have two fields mapping to
 "Q"
like this...

FWIW, I think this may explain why your PHP script is so much faster -
"unique" is quite a slow operation (even if no duplicate documents
exist, just checking for them significantly slows indexing).  Does your
PHP indexer contain code like this:

    $db->replace_document($qterm, $doc);

If not, does it handle enforcing unique documents another way?  If it
doesn't, then you aren't comparing like with like.

If this isn't the explanation, it would be interesting to work out why
there's such a difference.

Cheers,
    Olly

      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs