在awk
下方,如果$3
为SNV or MNV or INDEL
,我会尝试打印整行以及标题行。如果满足该条件或该条件为真,则在$4
中找到sub
模式:GMAF=
并检查=符号后面的值。如果该值小于或等于.01,则打印整行以及标题行。
由于$3
SNV
和$4
可能为空或空,因此我不确定如何捕获它。第2行就是一个例子。假设如果$4
中没有值,那么这与零相同,因此可能是重要的并且被提取。我也不确定如何在打印中包含标题行减去#
。 ---
不是文件的一部分,它们只是用于指示标题。我也为每一行添加了评论。谢谢 :)。
档案 tab-delimited
##.....
##.....
#ID Name Func List ---- header row ----
1 1 REF
2 2 SNV
3 3 SNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014
所需的输出 tab-delimited
ID Name Func List
2 2 SNV
3 3 SNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014
AWK
awk -F'\t' -v OFS='\t' 'NR>3 # define FS and OFS as tab and look in 3 row of file
$3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL"{ # start block and look in row 3 in`$2` for any of these words
sub(/:GMAF=*/,"",$4); # if found then search `$4` for `:GMAF=`
VAL=substr($4,RSTART+4,RLENGTH-4); 3 extract the 4 digits after the = sign as VAL
} # close block
for(i=1;i<=num;i++){ # create a loop to iterate over each line as i
if(VAL[i] <= 0.01){ 3 check each VAL in iand if less then or equal to 0.01
{ # start block
print $1, $2, $3, VAL; # print output
} # end block
next # process next line
} # end block
1' file
编辑Ed Morton只是为了更容易理解上述代码:
awk -F'\t' -v OFS='\t' ' # define FS and OFS as tab
NR>3 # and look in 3 row of file
$3 == "SNV" || $3 == "MNV" || $3 == "INDEL" { # start block and look in row 3 in`$2` for any of these words
sub(/:GMAF=*/,"",$4); # if found then search `$4` for `:GMAF=`
VAL=substr($4,RSTART+4,RLENGTH-4); 3 extract the 4 digits after the = sign as VAL
} # close block
for(i=1;i<=num;i++) { # create a loop to iterate over each line as i
if(VAL[i] <= 0.01) { 3 check each VAL in iand if less then or equal to 0.01
{ # start block
print $1, $2, $3, VAL; # print output
} # end block
next # process next line
} # end block
1' file
答案 0 :(得分:2)
简短回答:
要抓住$4
未设置/空白/不存在的情况,这意味着awk的字段总数为3(NF==3
)
要删除标题行前面的#
,您可以使用任何替代技术(即sub)。我在测试中使用了gensub。
完整答案:
波纹管代码似乎符合您的需求。虽然我没有使用制表符分隔文件,但您可以根据列表文件进行相应调整。
$ cat file4
##.....
##.....
#ID Name Func List
1 1 REF
2 2 SNV
3 3 SNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014
4 4 RNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014
5 5 SNV AMAF=0.0041:EMAF=0.0:GMAF=0.14
6 6 INDEL
7 7 RNV
8 8 SNV GMAF=0.0041:EMAF=0.0:AMAF=0.0014
9 9 SNV EMAF=0.0041:GMAF=0.1:AMAF=0.0014
$ awk 'NR<3{next}NR==3{print gensub(/^#/,"","1");next}($3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL") && NF==3{print;next}
($3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL") {val=gensub(/.*GMAF=(.[^:]*).*/,"\\1","g",$4);if (val<=0.1) print}' file4
ID Name Func List
2 2 SNV
3 3 SNV AMAF=0.0041:EMAF=0.0:GMAF=0.0014
6 6 INDEL
8 8 SNV GMAF=0.0041:EMAF=0.0:AMAF=0.0014
9 9 SNV EMAF=0.0041:GMAF=0.1:AMAF=0.0014
说明:
awk 'NR<3{next} # skip the first two lines
NR==3{print gensub(/^#/,"","1");next} # print the third line (header) by removing the leading #
($3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL") && NF==3{print;next} # Print the lines missing $4 and go to next line
($3 == "SNV"|| $3 == "MNV"|| $3 == "INDEL") { # if $3 fullfils the criteria then
val=gensub(/.*GMAF=(.[^:]*).*/,"\\1","g",$4); # isolate the value of GMAF with regex
if (val<=0.1) print; # compare and print
}' file4