[Xapian-discuss] bigrams search speed and index documents

Thu Nov 5 15:54:23 GMT 2009

Olly Betts wrote:
> On Tue, Nov 03, 2009 at 07:38:08PM -0600, Ying Liu wrote:
>   
>> I am using Xapian to index two XML files. In each file, there are about  
>> 6000+ pieces of news. Each piece of news is separated by <DOC> </DOC>.  
>> The way I build the index is:
>>
>> 1) read the XML file line by line, get one piece of news's head, date,  
>> and contents which are separated by tags
>> 2) remove  numbers, change to lower case,  remove stop words , and the  
>> information is saved in $buf
>> 3) new a Xapian::Document $doc, and use the TermGenerator to  
>> set_document($doc) and index_text($buf).
>> 4) add the $doc to the database $db
>>     
>
> Please post actual code rather than trying to describe it in English.

This is the way for me to build the index. (I also found something not 
efficient in my code when I do searching. ) For 17 MB files, it took 
about 11 seconds to scan every line and build the index.

Thanks,
Ying

while (my $line = <FILE>)
         {
             if ($line =~ m/^\<DOC/ )
             {                
                 #my $id = substr($line, 17, 13);
                 next;
             }
             elsif ($line =~ m/^\<\/DOC/ )
             {                
                 my $num_doc = $db->get_doccount;            
                 if ( $num_doc < 10000)
                 {
                     my $buf_news = join(' ', grep{ !$stopwords->{$_} } 
@words); # remove stop words                   
                     my $doc = Search::Xapian::Document->new or die 
"can't create doc object for $file: $!\n";
                    my $analyzer = Search::Xapian::TermGenerator->new;
                    $analyzer->set_document($doc);
                    $analyzer->index_text($buf_news);                   
                    $db->add_document($doc) or die "failed to add $file: 
$!";                        
                     @words = ();               
                 }
                 else
                 {
                     last;
                 }
             }
             elsif ($line =~ m/HEADLINE\>|DATELINE\>|TEXT\>|P\>$/ )
             {
                 next;        
             }
             else
             {
                 chop($line);
                 $line =~ s/\d//g; # remove numbers
                 my $lower_line = lc ($line); # change to lower case for 
removing stop words
                 push(@words, split(' ', $lower_line));                  

             }
         } # end of while loop for one file