[Xapian-discuss] bigrams search speed and index documents
Ying Liu
liux0395 at umn.edu
Thu Nov 5 15:54:23 GMT 2009
Olly Betts wrote:
> On Tue, Nov 03, 2009 at 07:38:08PM -0600, Ying Liu wrote:
>
>> I am using Xapian to index two XML files. In each file, there are about
>> 6000+ pieces of news. Each piece of news is separated by <DOC> </DOC>.
>> The way I build the index is:
>>
>> 1) read the XML file line by line, get one piece of news's head, date,
>> and contents which are separated by tags
>> 2) remove numbers, change to lower case, remove stop words , and the
>> information is saved in $buf
>> 3) new a Xapian::Document $doc, and use the TermGenerator to
>> set_document($doc) and index_text($buf).
>> 4) add the $doc to the database $db
>>
>
> Please post actual code rather than trying to describe it in English.
This is the way for me to build the index. (I also found something not
efficient in my code when I do searching. ) For 17 MB files, it took
about 11 seconds to scan every line and build the index.
Thanks,
Ying
while (my $line = <FILE>)
{
if ($line =~ m/^\<DOC/ )
{
#my $id = substr($line, 17, 13);
next;
}
elsif ($line =~ m/^\<\/DOC/ )
{
my $num_doc = $db->get_doccount;
if ( $num_doc < 10000)
{
my $buf_news = join(' ', grep{ !$stopwords->{$_} }
@words); # remove stop words
my $doc = Search::Xapian::Document->new or die
"can't create doc object for $file: $!\n";
my $analyzer = Search::Xapian::TermGenerator->new;
$analyzer->set_document($doc);
$analyzer->index_text($buf_news);
$db->add_document($doc) or die "failed to add $file:
$!";
@words = ();
}
else
{
last;
}
}
elsif ($line =~ m/HEADLINE\>|DATELINE\>|TEXT\>|P\>$/ )
{
next;
}
else
{
chop($line);
$line =~ s/\d//g; # remove numbers
my $lower_line = lc ($line); # change to lower case for
removing stop words
push(@words, split(' ', $lower_line));
}
} # end of while loop for one file
More information about the Xapian-discuss
mailing list