[Xapian-discuss] Search::Xapian questions
Torsten Foertsch
torsten.foertsch at gmx.net
Sat Nov 1 17:49:03 GMT 2008
Alex,
I am writing an indexer using your Search::Xapian module. From what I
have learned from xapian-omega/omindex.cc I think the following is
correct. Can you please confirm it?
my $documents=document_iterator($root);
while( defined($_=$documents->()) ) {
print "processing $_->{name}\n";
my $doc=Search::Xapian::Document->new;
$doc->set_data($_->{name});
$tg->set_document($doc);
my $incr=0;
if( length $_->{text} ) {
$tg->index_text(Encode::encode('utf8', $_->{text}));
$incr=1;
}
if( length $_->{header} ) {
if( $incr ) {
$incr=0;
$tg->increase_termpos(100);
}
$tg->index_text(Encode::encode('utf8', $_->{header}), 2);
$incr=1;
}
if( length $_->{title} ) {
if( $incr ) {
$incr=0;
$tg->increase_termpos(100);
}
$tg->index_text(Encode::encode('utf8', $_->{title}), 10);
}
$db->add_document($doc);
}
Is it correct to pass UTF8 encoded text to index_text() as done above?
How about stop words? Do I need them for indexing? I have read somewhere
that stopwords can be created from the database/termlist. How to do
that?
I have read your theoretical background page. It is said there that the
probabilistic IR model is the "correct" one. Other IR systems use the
TF/IDF (http://en.wikipedia.org/wiki/Tf-idf) algorithm to compute
ranks. How is it related to the probabilistic IR?
Thanks,
Torsten
--
Need professional mod_perl support?
Just hire me: torsten.foertsch at gmx.net
More information about the Xapian-discuss
mailing list