如何计算制表符分隔值(tsv)文件中的字符串实例?
tsv文件有数亿行,每行都是
形式foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
。如何计算文件中整个第二列中每个唯一整数的实例,理想情况下将计数添加为每行中的第五个值?
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2
我更喜欢仅使用UNIX命令行流处理程序的解决方案。
答案 0 :(得分:1)
我不清楚你想做什么。是否要添加0/1,具体取决于第二列的值作为第五列,或者是否要获取第二列中值的分布,即整个文件的总数?
在第一种情况下,请使用awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file
。
在第二种情况下,请使用类似awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file
的内容。
答案 1 :(得分:0)
使用perl
的一个解决方案,假设第二列的值已排序,我的意思是,当找到值2
时,具有相同值的所有行将是连续的。该脚本保留行,直到它在第二列中找到不同的值,获取计数,打印它们并释放内存,因此无论输入文件有多大,都不会产生问题:
script.pl
的内容:
use warnings;
use strict;
my (%lines, $count);
while ( <> ) {
## Remove last '\n'.
chomp;
## Split line in spaces.
my @f = split;
## Assume as malformed line if it hasn't four fields and omit it.
next unless @f == 4;
## Save lines in a hash until found a different value in second column.
## First line is special, because hash will always be empty.
## In last line avoid reading next one, otherwise I would lose lines
## saved in the hash.
## The hash will ony have one key at same time.
if ( exists $lines{ $f[1] } or $. == 1 ) {
push @{ $lines{ $f[1] } }, $_;
++$count;
next if ! eof;
}
## At this point, the second field of the file has changed (or is last line), so
## I will print previous lines saved in the hash, remove then and begin saving
## lines with new value.
## The value of the second column will be the key of the hash, get it now.
my ($key) = keys %lines;
## Read each line of the hash and print it appending the repeated lines as
## last field.
while ( @{ $lines{ $key } } ) {
printf qq[%s\t%d\n], shift @{ $lines{ $key } }, $count;
}
## Clear hash.
%lines = ();
## Add current line to hash, initialize counter and repeat all process
## until end of file.
push @{ $lines{ $f[1] } }, $_;
$count = 1;
}
infile
的内容:
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
像以下一样运行:
perl script.pl infile
使用以下输出:
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2