如何计算制表符分隔值文件中的字符串实例?

时间:2012-05-05 18:21:34

标签: csv awk sed scripting

如何计算制表符分隔值(tsv)文件中的字符串实例?

tsv文件有数亿行,每行都是

形式
foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

。如何计算文件中整个第二列中每个唯一整数的实例,理想情况下将计数添加为每行中的第五个值?

foobar1  1  xxx   yyy  1
foobar1  2  xxx   yyy  2
foobar2  2  xxx   yyy  2 
foobar2  3  xxx   yyy  2
foobar1  3  xxx   zzz  2

我更喜欢仅使用UNIX命令行流处理程序的解决方案。

2 个答案:

答案 0 :(得分:1)

我不清楚你想做什么。是否要添加0/1,具体取决于第二列的值作为第五列,或者是否要获取第二列中值的分布,即整个文件的总数?

在第一种情况下,请使用awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file

之类的内容

在第二种情况下,请使用类似awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file的内容。

答案 1 :(得分:0)

使用perl的一个解决方案,假设第二列的值已排序,我的意思是,当找到值2时,具有相同值的所有行将是连续的。该脚本保留行,直到它在第二列中找到不同的值,获取计数,打印它们并释放内存,因此无论输入文件有多大,都不会产生问题:

script.pl的内容:

use warnings;
use strict;

my (%lines, $count);

while ( <> ) { 

    ## Remove last '\n'.
    chomp;

    ## Split line in spaces.
    my @f = split;

    ## Assume as malformed line if it hasn't four fields and omit it.
    next unless @f == 4;

    ## Save lines in a hash until found a different value in second column.
    ## First line is special, because hash will always be empty.
    ## In last line avoid reading next one, otherwise I would lose lines
    ## saved in the hash.
    ## The hash will ony have one key at same time.
    if ( exists $lines{ $f[1] } or $. == 1 ) { 
        push @{ $lines{ $f[1] } }, $_; 
        ++$count;
        next if ! eof;
    }   

    ## At this point, the second field of the file has changed (or is last line), so 
    ## I will print previous lines saved in the hash, remove then and begin saving 
    ## lines with new value.

    ## The value of the second column will be the key of the hash, get it now.
    my ($key) = keys %lines;

    ## Read each line of the hash and print it appending the repeated lines as
    ## last field.
    while ( @{ $lines{ $key } } ) { 
        printf qq[%s\t%d\n], shift @{ $lines{ $key } }, $count;
    }   

    ## Clear hash.
    %lines = (); 

    ## Add current line to hash, initialize counter and repeat all process 
    ## until end of file.
    push @{ $lines{ $f[1] } }, $_; 
    $count = 1;
}

infile的内容:

foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

像以下一样运行:

perl script.pl infile

使用以下输出:

foobar1  1  xxx   yyy   1
foobar1  2  xxx   yyy   2
foobar2  2  xxx   yyy   2
foobar2  3  xxx   yyy   2
foobar1  3  xxx   zzz   2