[Xapian-discuss] Is this a correct method of indexing?

Henry henka at cityweb.co.za
Mon Jan 19 08:30:46 GMT 2009


 On Mon 19/01/09  8:26 AM , "Tony Lambiris" tonylambiris at gmail.com sent:
> I don't know if it's
> over-kill to index the entire document or not, or if there are any
> preferred methods. I had toyed with the idea of indexing only the
> first paragraph of the document, but I wanted to keep the input method
> totally unobtrusive when it came to the format of the text. All I care
> about is the title (or file name) and the contents, but I don't know
> if this is the best approach.... the database grows quite large and
> indexing slows down dramatically.

I suppose it depends on the intended purpose of your search app.  For typical
search engine apps, it's common to index documents up to a specific
size (eg, 100KB) only.

I suggest you keep it simple.  Xapian reminds me of UNIX, there's so many
ways of doing things it can be daunting initially.  

Use a simple TermGenerator, eg with Perl:

my $index =
      Search::Xapian::WritableDatabase->new( $index_path,
        DB_CREATE_OR_OVERWRITE )

my $doc_text = substr ($text, 0, $max);
my $tit_weight = 100;

my $tg= Search::Xapian::TermGenerator->new;
# ...set_stemmer, set_stopper, set_document...

# index the body text, letting index_text() take care of
# all the term details.
$tg->index_text ( $doc_text );

# simulate a 'field' with a prefix, boosting it's weight.
# this way you can search for [title:bob] if you want to.
# either way, if there's a hit in this title, it'll score a bit higher.
$tg->index_text ( $doc_title, $tit_weight, 'XTITLE');

...
$index->add_document($doc);


The nice thing about using this simple approach is that it's easier to understand
what the hell is going on initially, and you can always expand on it, getting all
dirty 'n sexy as needed.


Cheers
Henry



More information about the Xapian-discuss mailing list