[Xapian-discuss] bigrams search speed and index documents

Tue Nov 24 14:17:35 GMT 2009

Hi Olly,

Thanks for your explanation. Very helpful!

-Ying

Olly Betts wrote:
> On Mon, Nov 23, 2009 at 09:58:02AM -0600, Ying Liu wrote:
>   
>> The reason of the speed was slow was not because of Xapian. It's because  
>> of I use the Set::Scalar which makes the computation so slow. After I  
>> changed to hash, it's all right.
>>     
>
> Ah, good to know it's not a Xapian issue.
>
>   
>> I am still working on this Bigrams and I have another problem of Xapian.  
>> The problem is I am trying to find the bigrams of about 1 million  
>> documents (1.3G), about 206 millions of total terms with window size 2.   
>> I first scan the documents one by one and print out the total bigrams of  
>> the entire documents.  The print out file has about 239 millions of  
>> bigrams (3.3G), each line has one bigram, and some of them are repeated.  
>> Then I use Xapian to index this file to get the frequency of each  
>> bigrams. Each bigram is saved like 'last<>year', and it is save as one  
>> string(term) in the file.
>>
>> The following is my code:
>>
>> my $w_position = 0;
>> while (my $line = <FILE>)
>> {
>>        chop ($line);
>>        $w_position++;
>>        $doc->add_posting($line, $w_position);                             
>>                  }
>> close FILE;
>> $db->add_document($doc) or die "failed to add $file: $!"; 
>>
>>
>> After 23 minutes, it got an error:
>> terminate called after throwing an instance of 'std::bad_alloc' what():   
>>     
>
> This means you ran out of memory.
>
> You're attempting to add 239 million term postings to a single document.
> Document objects are built up in memory, and internally that is a C++
> std::map container, with an entry for each unique term.  So what you're
> doing here is using (or abusing perhaps) Xapian::Document as a memory-based
> associative array.
>
>   
>> Is there other way to index this 3.3G file? It works well on smaller  
>> files. I am testing some extreme cases. Thank you very much!
>>     
>
> If you are just doing this as a way to count frequencies, you could simple
> start a new document every N lines read.  The collection frequency of each term
> at the end will be the total number of times it appeared.
>
> Personally, I'd not use Xapian for that, but just use Perl hashes.  Your data
> is probably too large to process in one go, but you can make multiple runs
> over subsets of the bigrams.  A simple way would be to partition by the first
> byte, and run once for each possible first byte - something like this (totally
> untested) code:
>
>     foreach my $first_byte (0 .. 255) {
> 	my %frequency = ();
> 	seek FILE, 0, 0 or die $!;
>         while (my $line = <FILE>)
>         {
>             chop ($line);
>             ++$frequency{$line} if ord($line) == $first_byte;
>         }
> 	foreach my $bigram (sort keys %frequency) {
> 	    print "$frequency{$bigram}\t$bigram\n";
> 	}
>     }
>
> If this is going to get run a lot, you probably want to partition on a
> hashed version of $line to get a more even split so you can make fewer
> passes.
>
> Cheers,
>     Olly
>