[Xapian-discuss] bigrams search speed and index documents

Tue Nov 24 05:20:37 GMT 2009

On Mon, Nov 23, 2009 at 09:58:02AM -0600, Ying Liu wrote:
> The reason of the speed was slow was not because of Xapian. It's because  
> of I use the Set::Scalar which makes the computation so slow. After I  
> changed to hash, it's all right.

Ah, good to know it's not a Xapian issue.

> I am still working on this Bigrams and I have another problem of Xapian.  
> The problem is I am trying to find the bigrams of about 1 million  
> documents (1.3G), about 206 millions of total terms with window size 2.   
> I first scan the documents one by one and print out the total bigrams of  
> the entire documents.  The print out file has about 239 millions of  
> bigrams (3.3G), each line has one bigram, and some of them are repeated.  
> Then I use Xapian to index this file to get the frequency of each  
> bigrams. Each bigram is saved like 'last<>year', and it is save as one  
> string(term) in the file.
>
> The following is my code:
>
> my $w_position = 0;
> while (my $line = <FILE>)
> {
>        chop ($line);
>        $w_position++;
>        $doc->add_posting($line, $w_position);                             
>                  }
> close FILE;
> $db->add_document($doc) or die "failed to add $file: $!"; 
>
>
> After 23 minutes, it got an error:
> terminate called after throwing an instance of 'std::bad_alloc' what():   

This means you ran out of memory.

You're attempting to add 239 million term postings to a single document.
Document objects are built up in memory, and internally that is a C++
std::map container, with an entry for each unique term.  So what you're
doing here is using (or abusing perhaps) Xapian::Document as a memory-based
associative array.

> Is there other way to index this 3.3G file? It works well on smaller  
> files. I am testing some extreme cases. Thank you very much!

If you are just doing this as a way to count frequencies, you could simple
start a new document every N lines read.  The collection frequency of each term
at the end will be the total number of times it appeared.

Personally, I'd not use Xapian for that, but just use Perl hashes.  Your data
is probably too large to process in one go, but you can make multiple runs
over subsets of the bigrams.  A simple way would be to partition by the first
byte, and run once for each possible first byte - something like this (totally
untested) code:

    foreach my $first_byte (0 .. 255) {
	my %frequency = ();
	seek FILE, 0, 0 or die $!;
        while (my $line = <FILE>)
        {
            chop ($line);
            ++$frequency{$line} if ord($line) == $first_byte;
        }
	foreach my $bigram (sort keys %frequency) {
	    print "$frequency{$bigram}\t$bigram\n";
	}
    }

If this is going to get run a lot, you probably want to partition on a
hashed version of $line to get a more even split so you can make fewer
passes.

Cheers,
    Olly