我有一个数据集:
domain,ip,org
emileaben.com, 94.31.44.1, Level 3 Communications
anaplan.com, 94.31.44.12, Level 3 Communications
anaplan.com, 94.31.44.15, abc
anaplan.com, 94.31.44.19, Level 3 Communications
我想计算每个组织每个域名的ips数量,这会给我这个结果:
domain,countip,org
anaplan.com, 2, Level 3 Communications
emileaben.com, 1, Level 3 Communications
anaplan.com, 1, abc
有人可以帮忙吗?
答案 0 :(得分:3)
从命令行,没有排序,
perl -F, -ane'
BEGIN { $" = "," }
$. >1 or next;
$h{"@F[0,2]"}++;
END { print $k =~ s|,\K| $v,|r while ($k,$v) = each(%h) }
' file
有排序,
perl -F, -ane'
BEGIN { $" = "," }
$. >1 or next;
$h{"@F[0,2]"}++;
END { print s|,\K| $h{$_},|r for sort {$h{$b} <=> $h{$a}} keys %h }
' file
答案 1 :(得分:0)
您可以使用此命令行
cat FileName | awk -F, '{print $1, $3}' | sort | uniq -c
如果文件太大且难以处理,可以使用其中一个字段将其拆分,如下所示:
awk -F, '{gsub(" ", "_", $3); print $0 >> $3; close($3); }' FileName
我选择第3个字段,假设组织数量与域数相比相对较小。此命令用文件名中的'_'替换空格。从较小的文件中获取计数后,可以轻松组合它们。
编辑:解决方案的相对表现取决于比率:
number of unique keys (combinations of organization and domain) / Total number of lines in the file
如果此比率非常小,则使用哈希表计数会更好并且使用更少的内存。如果比例很大,排序会更好。
答案 2 :(得分:0)
data.csv
domain,ip,org
emileaben.com, 94.31.44.1, Level 3 Communications
anaplan.com, 94.31.44.12, Level 3 Communications
anaplan.com, 94.31.44.15, abc
anaplan.com, 94.31.44.19, Level 3 Communications
count.pl
#!/usr/bin/env perl
use warnings;
use strict;
my $results = {};
while (<>) {
next if $. == 1; # Skip Header
chomp;
my ($domain, $ip, $org) = split /,\s*/;
$results->{$domain}->{$org}->{$ip} = 1; # Ignore duplicates
}
print "domain,ip,org\n";
for my $domain (sort keys %$results) {
for my $org (sort keys %{ $results->{$domain} }) {
my $ips_per_domain = scalar keys %{ $results->{$domain}->{$org} };
print join(', ', $domain, $ips_per_domain, $org) . "\n";
}
}
cat data.csv | count.pl
domain,ip,org
anaplan.com, 2, Level 3 Communications
anaplan.com, 1, abc
emileaben.com, 1, Level 3 Communications
答案 3 :(得分:0)
稍微不同的解决方案可能需要更长的完整处理时间,但更安全的内存。它的工作原理是首先按组织对输入文件进行分组,然后根据这些较小的&#34; org&#34;计算ip计数。文件:
use strict;
use warnings;
use Data::Dumper;
use feature qw/say/;
my %fhs_by_organization;
while ( my $row = <> ) {
next if $. == 1; # Skip Header
chomp($row);
my ($domain, $ip, $org) = split(/,\s*/, $row);
unless ( exists $fhs_by_organization{$org} ) {
my $outfilename = join('_', split(/\s+/, $org)) . '.txt';
open my $fh, '>', $outfilename
or die "$!";
$fhs_by_organization{$org} = $fh;
}
say { $fhs_by_organization{$org} } "$domain, $ip";
}
# close resources
close($_) foreach values %fhs_by_organization;
# read each org file in separately to reduce memory load
my %ipcount_by_org_domain;
foreach my $org ( keys %fhs_by_organization ) {
if ( fork() == 0 ) { # child code
my $infilename = join('_', split(/\s+/, $org)) . '.txt';
open my $fh, '<', $infilename
or die "$!";
my %seen_domainips;
while ( my $org_row = <$fh> ) {
my ($domain, $ip) = split /,\s*/, $org_row;
# only count unique ips
next if $seen_domainips{$domain . $ip};
$ipcount_by_org_domain{$org}->{$domain}++;
$seen_domainips{$domain . $ip} = 1;
}
close $fh;
unlink $infilename;
exit;
}
}
# wait for all of the child processes to finish
wait for 0 .. scalar keys %fhs_by_organization;
# dump results
$Data::Dumper::Terse = 1;
print Dumper \%ipcount_by_org_domain;
另外:组织是唯一的,因此您可以使用fork来并行处理它们。在我的机器上处理3.5米的行文件大约需要19秒。
输出:
{
'Level 3 Communications' => {
'emileaben.com' => 1,
'anaplan.com' => 2
},
'abc' => {
'anaplan.com' => 1
}
}