Question

如何计算制表符分隔值（tsv）文件中的字符串实例？

tsv文件有数亿行，每行都是

形式

foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

。如何计算文件中整个第二列中每个唯一整数的实例，理想情况下将计数添加为每行中的第五个值？

foobar1  1  xxx   yyy  1
foobar1  2  xxx   yyy  2
foobar2  2  xxx   yyy  2 
foobar2  3  xxx   yyy  2
foobar1  3  xxx   zzz  2

我更喜欢仅使用UNIX命令行流处理程序的解决方案。

Answer 1

我不清楚你想做什么。是否要添加0/1，具体取决于第二列的值作为第五列，或者是否要获取第二列中值的分布，即整个文件的总数？

在第一种情况下，请使用awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file。

之类的内容

在第二种情况下，请使用类似awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file的内容。

Answer 2

使用perl的一个解决方案，假设第二列的值已排序，我的意思是，当找到值2时，具有相同值的所有行将是连续的。该脚本保留行，直到它在第二列中找到不同的值，获取计数，打印它们并释放内存，因此无论输入文件有多大，都不会产生问题：

script.pl的内容：

use warnings;
use strict;

my (%lines, $count);

while ( <> ) { 

    ## Remove last '\n'.
    chomp;

    ## Split line in spaces.
    my @f = split;

    ## Assume as malformed line if it hasn't four fields and omit it.
    next unless @f == 4;

    ## Save lines in a hash until found a different value in second column.
    ## First line is special, because hash will always be empty.
    ## In last line avoid reading next one, otherwise I would lose lines
    ## saved in the hash.
    ## The hash will ony have one key at same time.
    if ( exists $lines{ $f[1] } or $. == 1 ) { 
        push @{ $lines{ $f[1] } }, $_; 
        ++$count;
        next if ! eof;
    }   

    ## At this point, the second field of the file has changed (or is last line), so 
    ## I will print previous lines saved in the hash, remove then and begin saving 
    ## lines with new value.

    ## The value of the second column will be the key of the hash, get it now.
    my ($key) = keys %lines;

    ## Read each line of the hash and print it appending the repeated lines as
    ## last field.
    while ( @{ $lines{ $key } } ) { 
        printf qq[%s\t%d\n], shift @{ $lines{ $key } }, $count;
    }   

    ## Clear hash.
    %lines = (); 

    ## Add current line to hash, initialize counter and repeat all process 
    ## until end of file.
    push @{ $lines{ $f[1] } }, $_; 
    $count = 1;
}

infile的内容：

foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

像以下一样运行：

perl script.pl infile

使用以下输出：

foobar1  1  xxx   yyy   1
foobar1  2  xxx   yyy   2
foobar2  2  xxx   yyy   2
foobar2  3  xxx   yyy   2
foobar1  3  xxx   zzz   2

如何计算制表符分隔值文件中的字符串实例？

2 个答案: