From:
"Peter Pöschl" <address deleted>
Subject: Re: Counting symbols from a debian symbols file
Date: Mon, 19 Oct 2009 10:37:58 +0200
Subject: Re: Counting symbols from a debian symbols file
Date: Mon, 19 Oct 2009 10:37:58 +0200
Hi, > # check the rest of the command line arguments for matches against symbols. > Omega(n^2), sigh. > foreach my $file (@ARGV){ > foreach my $sym (keys %count){ > my $string=read_file($file); > if ($string =~ m/\b$sym\b/){ > $count{$sym}++; > } > } >} > my $string=read_file($file); Why do you do this again and again for every symbol? This should be moved in front of 'foreach my $sym'. For searching multiple symbols at once I would use this: # construct regex which can match every key in %count my $rx1_sym = join '|', reverse sort keys %count; $rx1_sym = qr{\b($rx1_sym)\b}msx; foreach my $file (@ARGV){ my $string=read_file($file); while ( $string =~ m{$rx1_sym}g ) { ++ $count{$1}; } } Best regards, Peter Poeschl
Date:
Mon, 19 Oct 2009 08:19:57 -0300
From: "David Bremner" <address deleted>
Subject: Re: Counting symbols from a debian symbols file
From: "David Bremner" <address deleted>
Subject: Re: Counting symbols from a debian symbols file
Peter Poeschl wrote: >> my $string=read_file($file); >Why do you do this again and again for every symbol? >This should be moved in front of 'foreach my $sym'. Thanks, that was really silly, and the script is now much faster. > # construct regex which can match every key in %count > my $rx1_sym = join '|', reverse sort keys %count; > $rx1_sym = qr{\b($rx1_sym)\b}msx; > foreach my $file (@ARGV){ > my $string=read_file($file); > while ( $string =~ m{$rx1_sym}g ) { > ++ $count{$1}; > } > } This seems to count the total number of times a symbol occurs, rather than the total number of files it occurs in. The latter is what I am interested in. In fact, it probably suffices to fine list of those symbols that occur only once. I think your code could be modified to use a hash storing where the symbol was first found, and maybe an auxilary hash marking those found more than once. Whether it is worthwhile depends on how enormous the symbol files in question are I guess. In my unscientific experiments, after modifying the script to only read each file once (which sped things up by a factor of 3 or so), the other modification yields something like a 3% or 4% improvement. Thanks again for your suggestions. David