Perl:使用大数据计算的最快方式(3亿+行)

时间:2015-07-17 16:53:59

标签: perl loops count

我有一个数据集:

domain,ip,org
emileaben.com, 94.31.44.1, Level 3 Communications
anaplan.com, 94.31.44.12, Level 3 Communications
anaplan.com, 94.31.44.15, abc
anaplan.com, 94.31.44.19, Level 3 Communications

我想计算每个组织每个域名的ips数量,这会给我这个结果:

domain,countip,org
anaplan.com, 2, Level 3 Communications
emileaben.com, 1, Level 3 Communications
anaplan.com, 1, abc

有人可以帮忙吗?

4 个答案:

答案 0 :(得分:3)

从命令行,没有排序,

perl -F, -ane'
  BEGIN { $" = "," }
  $. >1 or next;
  $h{"@F[0,2]"}++;
  END { print $k =~ s|,\K| $v,|r while ($k,$v) = each(%h)  }
' file

有排序,

perl -F, -ane'
  BEGIN { $" = "," }
  $. >1 or next;
  $h{"@F[0,2]"}++;
  END { print s|,\K| $h{$_},|r for sort {$h{$b} <=> $h{$a}} keys %h  }
' file

答案 1 :(得分:0)

您可以使用此命令行

cat FileName | awk -F, '{print $1, $3}' | sort | uniq -c

如果文件太大且难以处理,可以使用其中一个字段将其拆分,如下所示:

awk -F, '{gsub(" ", "_", $3); print $0 >> $3; close($3); }' FileName

我选择第3个字段,假设组织数量与域数相比相对较小。此命令用文件名中的'_'替换空格。从较小的文件中获取计数后,可以轻松组合它们。

编辑:解决方案的相对表现取决于比率:

number of unique keys (combinations of organization and domain) / Total number of lines in the file

如果此比率非常小,则使用哈希表计数会更好并且使用更少的内存。如果比例很大,排序会更好。

答案 2 :(得分:0)

data.csv

domain,ip,org
emileaben.com, 94.31.44.1, Level 3 Communications
anaplan.com, 94.31.44.12, Level 3 Communications
anaplan.com, 94.31.44.15, abc
anaplan.com, 94.31.44.19, Level 3 Communications

count.pl

#!/usr/bin/env perl

use warnings;
use strict;

my $results = {};
while (<>) {
    next if $. == 1; # Skip Header
    chomp;
    my ($domain, $ip, $org) = split /,\s*/;
    $results->{$domain}->{$org}->{$ip} = 1;  # Ignore duplicates
}

print "domain,ip,org\n";
for my $domain (sort keys %$results) {
    for my $org (sort keys %{ $results->{$domain} }) {
        my $ips_per_domain = scalar keys %{ $results->{$domain}->{$org} };
        print join(', ', $domain, $ips_per_domain, $org) . "\n";
    }
}

cat data.csv | count.pl

domain,ip,org
anaplan.com, 2, Level 3 Communications
anaplan.com, 1, abc
emileaben.com, 1, Level 3 Communications

答案 3 :(得分:0)

稍微不同的解决方案可能需要更长的完整处理时间,但更安全的内存。它的工作原理是首先按组织对输入文件进行分组,然后根据这些较小的&#34; org&#34;计算ip计数。文件:

use strict;
use warnings;

use Data::Dumper; 
use feature qw/say/;

my %fhs_by_organization;
while ( my $row = <> ) {
    next if $. == 1; # Skip Header
    chomp($row);

    my ($domain, $ip, $org) = split(/,\s*/, $row);
    unless ( exists $fhs_by_organization{$org} ) {
       my $outfilename = join('_', split(/\s+/, $org)) . '.txt';
       open my $fh, '>', $outfilename
          or die "$!";
       $fhs_by_organization{$org} = $fh;
    }
    say { $fhs_by_organization{$org} } "$domain, $ip";
}

# close resources
close($_) foreach values %fhs_by_organization;

# read each org file in separately to reduce memory load
my %ipcount_by_org_domain;
foreach my $org ( keys %fhs_by_organization ) {
    if ( fork() == 0 ) { # child code 
       my $infilename = join('_', split(/\s+/, $org)) . '.txt';
       open my $fh, '<', $infilename
          or die "$!";

       my %seen_domainips;
       while ( my $org_row = <$fh> ) {
          my ($domain, $ip) = split /,\s*/, $org_row;

          # only count unique ips 
          next if $seen_domainips{$domain . $ip};

          $ipcount_by_org_domain{$org}->{$domain}++;
          $seen_domainips{$domain . $ip} = 1;
       }
       close $fh;
       unlink $infilename;
       exit;
    }
}

# wait for all of the child processes to finish
wait for 0 .. scalar keys %fhs_by_organization;

# dump results
$Data::Dumper::Terse = 1;
print Dumper \%ipcount_by_org_domain;

另外:组织是唯一的,因此您可以使用fork来并行处理它们。在我的机器上处理3.5米的行文件大约需要19秒。

输出:

{
  'Level 3 Communications' => {
                                'emileaben.com' => 1,
                                'anaplan.com' => 2
                              },
  'abc' => {
             'anaplan.com' => 1
           }
}