[Xapian-discuss] Search::Xapian questions

Sat Nov 1 17:49:03 GMT 2008

Alex,

I am writing an indexer using your Search::Xapian module. From what I 
have learned from xapian-omega/omindex.cc I think the following is 
correct. Can you please confirm it?

my $documents=document_iterator($root);
while( defined($_=$documents->()) ) {
  print "processing $_->{name}\n";

  my $doc=Search::Xapian::Document->new;
  $doc->set_data($_->{name});
  $tg->set_document($doc);
  my $incr=0;
  if( length $_->{text} ) {
    $tg->index_text(Encode::encode('utf8', $_->{text}));
    $incr=1;
  }
  if( length $_->{header} ) {
    if( $incr ) {
      $incr=0;
      $tg->increase_termpos(100);
    }
    $tg->index_text(Encode::encode('utf8', $_->{header}), 2);
    $incr=1;
  }
  if( length $_->{title} ) {
    if( $incr ) {
      $incr=0;
      $tg->increase_termpos(100);
    }
    $tg->index_text(Encode::encode('utf8', $_->{title}), 10);
  }
  $db->add_document($doc);
}

Is it correct to pass UTF8 encoded text to index_text() as done above?

How about stop words? Do I need them for indexing? I have read somewhere 
that stopwords can be created from the database/termlist. How to do 
that?

I have read your theoretical background page. It is said there that the 
probabilistic IR model is the "correct" one. Other IR systems use the 
TF/IDF (http://en.wikipedia.org/wiki/Tf-idf) algorithm to compute 
ranks. How is it related to the probabilistic IR?

Thanks,
Torsten

--
Need professional mod_perl support?
Just hire me: torsten.foertsch at gmx.net