Question

我有一个数据集：

domain,ip,org
emileaben.com, 94.31.44.1, Level 3 Communications
anaplan.com, 94.31.44.12, Level 3 Communications
anaplan.com, 94.31.44.15, abc
anaplan.com, 94.31.44.19, Level 3 Communications

我想计算每个组织每个域名的ips数量，这会给我这个结果：

domain,countip,org
anaplan.com, 2, Level 3 Communications
emileaben.com, 1, Level 3 Communications
anaplan.com, 1, abc

有人可以帮忙吗？

Answer 1

从命令行，没有排序，

perl -F, -ane'
  BEGIN { $" = "," }
  $. >1 or next;
  $h{"@F[0,2]"}++;
  END { print $k =~ s|,\K| $v,|r while ($k,$v) = each(%h)  }
' file

有排序，

perl -F, -ane'
  BEGIN { $" = "," }
  $. >1 or next;
  $h{"@F[0,2]"}++;
  END { print s|,\K| $h{$_},|r for sort {$h{$b} <=> $h{$a}} keys %h  }
' file

Answer 2

您可以使用此命令行

cat FileName | awk -F, '{print $1, $3}' | sort | uniq -c

如果文件太大且难以处理，可以使用其中一个字段将其拆分，如下所示：

awk -F, '{gsub(" ", "_", $3); print $0 >> $3; close($3); }' FileName

我选择第3个字段，假设组织数量与域数相比相对较小。此命令用文件名中的'_'替换空格。从较小的文件中获取计数后，可以轻松组合它们。

编辑：解决方案的相对表现取决于比率：

number of unique keys (combinations of organization and domain) / Total number of lines in the file

如果此比率非常小，则使用哈希表计数会更好并且使用更少的内存。如果比例很大，排序会更好。

Answer 3

data.csv

domain,ip,org
emileaben.com, 94.31.44.1, Level 3 Communications
anaplan.com, 94.31.44.12, Level 3 Communications
anaplan.com, 94.31.44.15, abc
anaplan.com, 94.31.44.19, Level 3 Communications

count.pl

#!/usr/bin/env perl

use warnings;
use strict;

my $results = {};
while (<>) {
    next if $. == 1; # Skip Header
    chomp;
    my ($domain, $ip, $org) = split /,\s*/;
    $results->{$domain}->{$org}->{$ip} = 1;  # Ignore duplicates
}

print "domain,ip,org\n";
for my $domain (sort keys %$results) {
    for my $org (sort keys %{ $results->{$domain} }) {
        my $ips_per_domain = scalar keys %{ $results->{$domain}->{$org} };
        print join(', ', $domain, $ips_per_domain, $org) . "\n";
    }
}

cat data.csv | count.pl

domain,ip,org
anaplan.com, 2, Level 3 Communications
anaplan.com, 1, abc
emileaben.com, 1, Level 3 Communications

Answer 4

稍微不同的解决方案可能需要更长的完整处理时间，但更安全的内存。它的工作原理是首先按组织对输入文件进行分组，然后根据这些较小的＆＃34; org＆＃34;计算ip计数。文件：

use strict;
use warnings;

use Data::Dumper; 
use feature qw/say/;

my %fhs_by_organization;
while ( my $row = <> ) {
    next if $. == 1; # Skip Header
    chomp($row);

    my ($domain, $ip, $org) = split(/,\s*/, $row);
    unless ( exists $fhs_by_organization{$org} ) {
       my $outfilename = join('_', split(/\s+/, $org)) . '.txt';
       open my $fh, '>', $outfilename
          or die "$!";
       $fhs_by_organization{$org} = $fh;
    }
    say { $fhs_by_organization{$org} } "$domain, $ip";
}

# close resources
close($_) foreach values %fhs_by_organization;

# read each org file in separately to reduce memory load
my %ipcount_by_org_domain;
foreach my $org ( keys %fhs_by_organization ) {
    if ( fork() == 0 ) { # child code 
       my $infilename = join('_', split(/\s+/, $org)) . '.txt';
       open my $fh, '<', $infilename
          or die "$!";

       my %seen_domainips;
       while ( my $org_row = <$fh> ) {
          my ($domain, $ip) = split /,\s*/, $org_row;

          # only count unique ips 
          next if $seen_domainips{$domain . $ip};

          $ipcount_by_org_domain{$org}->{$domain}++;
          $seen_domainips{$domain . $ip} = 1;
       }
       close $fh;
       unlink $infilename;
       exit;
    }
}

# wait for all of the child processes to finish
wait for 0 .. scalar keys %fhs_by_organization;

# dump results
$Data::Dumper::Terse = 1;
print Dumper \%ipcount_by_org_domain;

另外：组织是唯一的，因此您可以使用fork来并行处理它们。在我的机器上处理3.5米的行文件大约需要19秒。

输出：

{
  'Level 3 Communications' => {
                                'emileaben.com' => 1,
                                'anaplan.com' => 2
                              },
  'abc' => {
             'anaplan.com' => 1
           }
}

Perl：使用大数据计算的最快方式（3亿+行）

4 个答案: