我正在处理两个大数据集(300 x 500,000),我在两个数据中都有一个包含0,1,2和NA值的矩阵,我想比较这些文件并计算数字每行都匹配两个文件,并将结果插入到输出表结果中。
File 1
2 1 0
0 1 1
1 0 NA
File 2
2 1 0
Na 1 1
1 NA 0
如何比较每行中的匹配值计数和总和?
答案 0 :(得分:0)
我已经用“总数”解释了你的意思,并且只是倾销了匹配线的数量,但这符合你的要求,你应该能够将它用于你的确切规格
#!/usr/bin/perl
#
use Data::Dumper;
use strict;
use warnings;
# open files with error checking
open(my $f1,"file1") || die "$! file1";
open(my $f2,"file2") || die "$! file2";
#hash to store count of similar rows in
my %match_count=();
#total sum
my $total=0;
#read line from each file, lower case it to ignore Na NA difference and
#chomp to remove \n so this isn't stored
while(my $l1=lc(<$f1>)) {
my $l2 = lc(<$f2>);
chomp($l1);
chomp($l2);
#see if lines are the same
if ($l1 eq $l2) {
#increment counter for this line
$match_count{$l1}++;
#find sum of row and add to total
my ($first,$second,$third) = split(/\s/,$l1);
$total += $first+$second+$third;
}
}
print "sum total of matches = $total\n";
print Dumper(\%match_count);