awk匹配关键字并检查另一个字段的子模式

时间:2017-07-08 19:44:51

标签: awk

awk下方,如果$3SNV or MNV or INDEL,我会尝试打印整行以及标题行。如果满足该条件或该条件为真,则在$4中找到sub模式:GMAF=并检查=符号后面的值。如果该值小于或等于.01,则打印整行以及标题行。

由于$3 SNV$4可能为空或空,因此我不确定如何捕获它。第2行就是一个例子。假设如果$4中没有值,那么这与零相同,因此可能是重要的并且被提取。我也不确定如何在打印中包含标题行减去#---不是文件的一部分,它们只是用于指示标题。我也为每一行添加了评论。谢谢 :)。

档案 tab-delimited

##.....
##.....
#ID Name    Func    List     ---- header row ----
1   1   REF 
2   2   SNV 
3   3   SNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014

所需的输出 tab-delimited

ID  Name    Func    List
2   2   SNV 
3   3   SNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014

AWK

awk -F'\t' -v OFS='\t' 'NR>3   # define FS and OFS as tab and look in 3 row of file  
        $3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL"{   # start block and look in row 3 in`$2` for any of these words
        sub(/:GMAF=*/,"",$4);  # if found then search `$4` for `:GMAF=`
        VAL=substr($4,RSTART+4,RLENGTH-4);   3 extract the 4 digits after the = sign as VAL
                                             }  # close block
            for(i=1;i<=num;i++){   # create a loop to iterate over each line as i
                    if(VAL[i] <= 0.01){  3 check each VAL in iand if less then or equal to 0.01
                    {  # start block
                                    print $1, $2, $3, VAL;  # print output
                                      }  # end block
                next   # process next line
                }  # end block
                1' file

编辑Ed Morton只是为了更容易理解上述代码:

awk -F'\t' -v OFS='\t' '                           # define FS and OFS as tab
    NR>3                                           # and look in 3 row of file

    $3 == "SNV" || $3 == "MNV" || $3 == "INDEL" {  # start block and look in row 3 in`$2` for any of these words
        sub(/:GMAF=*/,"",$4);                      # if found then search `$4` for `:GMAF=`
        VAL=substr($4,RSTART+4,RLENGTH-4);         3 extract the 4 digits after the = sign as VAL
    }                                              # close block

    for(i=1;i<=num;i++) {                          # create a loop to iterate over each line as i
        if(VAL[i] <= 0.01) {                       3 check each VAL in iand if less then or equal to 0.01
            {                                      # start block
                print $1, $2, $3, VAL;             # print output
            }                                      # end block
            next                                   # process next line
        }                                          # end block
1' file

1 个答案:

答案 0 :(得分:2)

简短回答:

要抓住$4未设置/空白/不存在的情况,这意味着awk的字段总数为3(NF==3

要删除标题行前面的#,您可以使用任何替代技术(即sub)。我在测试中使用了gensub。

完整答案:
波纹管代码似乎符合您的需求。虽然我没有使用制表符分隔文件,但您可以根据列表文件进行相应调整。

$ cat file4
##.....
##.....
#ID Name    Func    List
1   1   REF 
2   2   SNV 
3   3   SNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014
4   4   RNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014
5   5   SNV AMAF=0.0041:EMAF=0.0:GMAF=0.14
6   6   INDEL
7   7   RNV
8   8   SNV GMAF=0.0041:EMAF=0.0:AMAF=0.0014
9   9   SNV EMAF=0.0041:GMAF=0.1:AMAF=0.0014

$ awk 'NR<3{next}NR==3{print gensub(/^#/,"","1");next}($3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL") && NF==3{print;next}       
($3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL") {val=gensub(/.*GMAF=(.[^:]*).*/,"\\1","g",$4);if (val<=0.1) print}' file4
ID  Name    Func    List
2   2   SNV 
3   3   SNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014
6   6   INDEL
8   8   SNV GMAF=0.0041:EMAF=0.0:AMAF=0.0014
9   9   SNV EMAF=0.0041:GMAF=0.1:AMAF=0.0014

说明:

awk 'NR<3{next}                                                       # skip the first two lines
     NR==3{print gensub(/^#/,"","1");next}                            # print the third line (header) by removing the leading #
     ($3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL") && NF==3{print;next} # Print the lines missing $4 and go to next line    
     ($3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL") {                    # if $3 fullfils the criteria then
        val=gensub(/.*GMAF=(.[^:]*).*/,"\\1","g",$4);                 # isolate the value of GMAF with regex
        if (val<=0.1) print;                                          # compare and print
        }' file4