[Xapian-discuss] bigrams search speed and index documents

Mon Nov 23 15:58:02 GMT 2009

Olly Betts wrote:
> On Thu, Nov 05, 2009 at 05:52:38PM -0600, Ying Liu wrote:
>   
>> I have found the way to speed up the searching speed.
>>     
>
> It would be interesting to hear what you did to improve the search speed...
>
> Cheers,
>     Olly
>   
Hi Olly,

Thanks for asking.

The reason of the speed was slow was not because of Xapian. It's because 
of I use the Set::Scalar which makes the computation so slow. After I 
changed to hash, it's all right.

I am still working on this Bigrams and I have another problem of Xapian. 
The problem is I am trying to find the bigrams of about 1 million 
documents (1.3G), about 206 millions of total terms with window size 2.  
I first scan the documents one by one and print out the total bigrams of 
the entire documents.  The print out file has about 239 millions of 
bigrams (3.3G), each line has one bigram, and some of them are repeated. 
Then I use Xapian to index this file to get the frequency of each 
bigrams. Each bigram is saved like 'last<>year', and it is save as one 
string(term) in the file.

The following is my code:

my $w_position = 0;
while (my $line = <FILE>)
{
        chop ($line);
        $w_position++;
        $doc->add_posting($line, 
$w_position);                                              
}
close FILE;
$db->add_document($doc) or die "failed to add $file: $!"; 

After 23 minutes, it got an error:
terminate called after throwing an instance of 'std::bad_alloc' what():  
std::bad_alloc
-bash: line 7:  6660 Aborted              

Is there other way to index this 3.3G file? It works well on smaller 
files. I am testing some extreme cases. Thank you very much!

-Ying