awk根据特定规则输出文件

时间:2015-10-12 19:26:16

标签: awk

我尝试使用下面的awk来使输出看起来像所需的输出并且在语法上有些麻烦。我似乎遇到问题的部分是在特定目标$3中使用这些基数$1少于30次读取,输出#并计算平均值。谢谢你:)。

awk '
    {N[$1]++
     T[$1]+=$4
     M[$1]=$2
    }
END     {for (X in N) printf ("%s is %d bases and maps to %s with an average depth"\
                            " of %f reads\n", X, N[X], M[X], T[X]/N[X]);
    }
'  input.txt > output.txt

输入

chr1:955542-955763  AGRN:exon.1 1   0
chr1:955542-955763  AGRN:exon.1 2   0
chr1:955542-955763  AGRN:exon.1 3   0
chr1:955542-955763  AGRN:exon.1 4   1
chr1:955542-955763  AGRN:exon.1 5   1
chr1:955542-955763  AGRN:exon.1 6   1
chr1:955542-955763  AGRN:exon.1 7   1
chr1:955542-955763  AGRN:exon.1 8   1
chr1:955542-955763  AGRN:exon.1 9   1
chr1:955542-955763  AGRN:exon.1 10  1
chr1:955542-955763  AGRN:exon.1 11  32

当前输出

chr1:955542-955763 is 11 bases and maps to AGRN:exon.1 with an average depth of 3.545455 reads

所需的输出

chr1:955542-955763 is 11 bases and maps to AGRN:exon.1 with an average depth of 3.54 reads and there are 10 bases less than 30 reads with an average coverage of 0.63 reads

编辑(字段说明)

awk '{for (i=1; i<=NF; i++) print i, $i}' input.txt

1 chr1:955542-955763 (defines the specific target location) - variable N
2 AGRN:exon.1  (defines the name/id of the target location) - variable M
3 1   (defines the exact base on the target)
4 0    (used to calculate the average) - variable T

输出的第一部分似乎完美无缺,它只是添加到那个尝试获得第二部分。基本上是and there are 10 bases less than 30 reads with an average coverage of 0.63 reads

其中10$2中最后一个基数少于30次的基数。 0.63$4中所有#的平均值。我希望这有帮助,谢谢你:)。

2-D输出

Lo: chr1:955542-955763 is 10 bases and maps to AGRN:exon.1 with an average depth of 0.700000 reads
Hi: chr1:955542-955763 is 1 bases and maps to AGRN:exon.1 with an average depth of **2.909091** reads  ( should be 32 - `$4` is 32 / 1)

1 个答案:

答案 0 :(得分:1)

更新了答案

对于阈值2-D类型的输出,对于2-D数组,我将恢复为GNU awk

gawk '
    {  i=1                 # use second index of 1 for $4 < 30
       if($4>=30)i=2       # use second index of 2 for $4 >= 30
       N[$1][i]++
       T[$1][i]+=$4
       B[$1][i]++
       M[$1][i]=$2
    }
    END {
       for (X in N){
          printf ("Lo: %s is %d bases and maps to %s with an average depth"\
                            " of %f reads\n", X, N[X][1], M[X][1], T[X][1]/B[X][1]);
          printf ("Hi: %s is %d bases and maps to %s with an average depth"\
                            " of %f reads\n", X, N[X][2], M[X][2], T[X][2]/B[X][2]);
       }
    }    ' input.txt

<强>输出

Lo: chr1:955542-955763 is 10 bases and maps to AGRN:exon.1 with an average depth of 0.700000 reads
Hi: chr1:955542-955763 is 1 bases and maps to AGRN:exon.1 with an average depth of 32.000000 reads

原始答案

我认为你想要这样的东西,它会忽略最后一个字段为30或更多的行:

awk '
    $4 < 30 {
       N[$1]++
       T[$1]+=$4
       B[$1]=$3
       M[$1]=$2
    }
    END {
       for (X in N) printf ("%s is %d bases and maps to %s with an average depth"\
                            " of %f reads\n", X, N[X], M[X], T[X]/B[X]);
    } ' input.txt