[Xapian-discuss] bigrams search speed and index documents
Ying Liu
liux0395 at umn.edu
Tue Nov 24 14:17:35 GMT 2009
Hi Olly,
Thanks for your explanation. Very helpful!
-Ying
Olly Betts wrote:
> On Mon, Nov 23, 2009 at 09:58:02AM -0600, Ying Liu wrote:
>
>> The reason of the speed was slow was not because of Xapian. It's because
>> of I use the Set::Scalar which makes the computation so slow. After I
>> changed to hash, it's all right.
>>
>
> Ah, good to know it's not a Xapian issue.
>
>
>> I am still working on this Bigrams and I have another problem of Xapian.
>> The problem is I am trying to find the bigrams of about 1 million
>> documents (1.3G), about 206 millions of total terms with window size 2.
>> I first scan the documents one by one and print out the total bigrams of
>> the entire documents. The print out file has about 239 millions of
>> bigrams (3.3G), each line has one bigram, and some of them are repeated.
>> Then I use Xapian to index this file to get the frequency of each
>> bigrams. Each bigram is saved like 'last<>year', and it is save as one
>> string(term) in the file.
>>
>> The following is my code:
>>
>> my $w_position = 0;
>> while (my $line = <FILE>)
>> {
>> chop ($line);
>> $w_position++;
>> $doc->add_posting($line, $w_position);
>> }
>> close FILE;
>> $db->add_document($doc) or die "failed to add $file: $!";
>>
>>
>> After 23 minutes, it got an error:
>> terminate called after throwing an instance of 'std::bad_alloc' what():
>>
>
> This means you ran out of memory.
>
> You're attempting to add 239 million term postings to a single document.
> Document objects are built up in memory, and internally that is a C++
> std::map container, with an entry for each unique term. So what you're
> doing here is using (or abusing perhaps) Xapian::Document as a memory-based
> associative array.
>
>
>> Is there other way to index this 3.3G file? It works well on smaller
>> files. I am testing some extreme cases. Thank you very much!
>>
>
> If you are just doing this as a way to count frequencies, you could simple
> start a new document every N lines read. The collection frequency of each term
> at the end will be the total number of times it appeared.
>
> Personally, I'd not use Xapian for that, but just use Perl hashes. Your data
> is probably too large to process in one go, but you can make multiple runs
> over subsets of the bigrams. A simple way would be to partition by the first
> byte, and run once for each possible first byte - something like this (totally
> untested) code:
>
> foreach my $first_byte (0 .. 255) {
> my %frequency = ();
> seek FILE, 0, 0 or die $!;
> while (my $line = <FILE>)
> {
> chop ($line);
> ++$frequency{$line} if ord($line) == $first_byte;
> }
> foreach my $bigram (sort keys %frequency) {
> print "$frequency{$bigram}\t$bigram\n";
> }
> }
>
> If this is going to get run a lot, you probably want to partition on a
> hashed version of $line to get a more even split so you can make fewer
> passes.
>
> Cheers,
> Olly
>
More information about the Xapian-discuss
mailing list