如何处理具有条件的重复列

时间:2011-09-26 17:16:52

标签: perl shell awk

如果第2列为空,我需要跳过具有相同第一列的所有行,然后对于其他行,我需要计算第3列中第4列的百分比吗?

输入:

T75PA       2   0   
T75PA   kk  4   1   
T240P       4   3   
T240P   test    3   3   
T240P   test2   3   1   
T245P   rr  8   1   
T245P   rr  33  1   
T226PA  fg  4   2   
T226PA  g   51  38  
T226PA  e   41  34

输出

T245P   rr  8   1   0.125
T245P   rr  33  1   0.03030303
T226PA  fg  4   2   0.5
T226PA  g   51  38  0.745098039
T226PA  e   41  34  0.829268293

4 个答案:

答案 0 :(得分:1)

尝试:

awk '$2 ~ /[0-9]+/{for(i in res){if ($1 ~ res[i])delete res[i]};\
rm[$1]=$1;next}\
{if($1 in rm)next;ratio=$4/$3;res[NR]=$0"\t"ratio}\
END{for (i in res)print res[i]}' file

这将忽略少于四个条目的所有行, 对于所有其他条目,计算和连接定量 与entrie并保存在数组res中。经过处理后 文件,res的条目打印到stdout。

输出:

T245P   rr  8   1       0.125
T245P   rr  33  1       0.030303
T226PA  fg  4   2       0.5
T226PA  g   51  38      0.745098
T226PA  e   41  34          0.829268

HTH Chris

答案 1 :(得分:1)

我假设您的数据是制表符分隔的。像这样的perl脚本(我还没有测试过它)......

my @data;
my %counts;
my %blanks;
while( my $line = <STDIN> )
{
    chop($line);
    my @rec = split( "\t", $line );
    push( @data, \@rec );
    $counts{$rec[0]}++;
    if( $rec[1] eq '' )
    {
        $blanks{$rec[0]}++;
    }
}
foreach my $rec ( @data )
{
    if( $counts{$rec->[0]} <= 1 || !$blanks{$rec->[0]} )
    {
        print join( "\t", @$rec, $rec->[3] / $rec->[2] ) . "\n";
    }
}

答案 2 :(得分:1)

怎么样:

#!/usr/bin/perl
use Modern::Perl;


my $re = qr/^([A-Z0-9]+)\s+?(\S+|\s+)\s+(\d+)\s+(\d+)\s*$/;
my $skip = '';
while (<DATA>) {
    chomp;
    if (my @l = $_ =~ /$re/) {
        if ($l[1] =~ /^\s+$/ || $skip eq $l[0]) {
            $skip = $l[0];
            next;
        }
        $skip = '';
        my $r = $l[3] / $l[2];
        say "$_\t$r";
    }
}

__DATA__
T75PA       2   0   
T75PA   kk  4   1   
T240P       4   3   
T240P   test    3   3   
T240P   test2   3   1   
T245P   rr  8   1   
T245P   rr  33  1   
T226PA  fg  4   2   
T226PA  g   51  38  
T226PA  e   41  34

<强>输出:

T245P   rr  8   1       0.125
T245P   rr  33  1       0.0303030303030303
T226PA  fg  4   2       0.5
T226PA  g   51  38      0.745098039215686
T226PA  e   41  34  0.829268292682927

答案 3 :(得分:1)

awk '
    NR==FNR {if (NF < 4) blank[$1]; next}
    $1 in blank {next}
    {$(NF+1) = $4/$3; print}
' datafile datafile | column -t

因为你现在说字段分隔符是tab:

awk '
    BEGIN {OFS = FS = "\t"}
    NR==FNR {if ($2 == "") blank[$1]; next}
    $1 in blank {next}
    {$5 = $4/$3; print}
' datafile datafile