grep如果单词值大于value

时间:2019-01-21 22:22:10

标签: shell awk sed grep bioinformatics

我这样归档:

                   Price_1  Price_2   Price_3
Ticker Date                                  
ABC    2018-07-01      9.0      0.0  0.000000
       2018-07-02      8.0      8.5  0.000000
       2018-07-03      7.0      7.5  8.000000
       2018-07-04      8.0      7.5  7.666667
       2018-07-05      8.0      8.0  7.666667
HIJ    2018-07-01      8.0      0.0  0.000000
       2018-07-02      9.0      8.5  0.000000
       2018-07-03      5.0      7.0  7.333333
       2018-07-04      6.0      5.5  6.666667
       2018-07-05      7.0      6.5  6.000000
XYZ    2018-07-01      9.0      0.0  0.000000
       2018-07-02      5.0      7.0  0.000000
       2018-07-03      9.0      7.0  7.666667
       2018-07-04      8.0      8.5  7.333333
       2018-07-05      6.0      7.0  7.666667

我想在两步上grep行:

具有1 51710 . C A . clustered_events;contamination;germline_risk;read_position;t_lod DP=1;ECNT=6;POP_AF=1.000e-03;P_GERMLINE=-1.372e-02;TLOD=4.20 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB 0/1:0,1:1.000:1:0,0:0,1:26:0,136:43:2:0|1:51637_C_T:0.990,0.00,1.00:0.025,0.028,0.947 19 27733067 . A G,C . clustered_events;contamination;germline_risk;multiallelic DP=60;ECNT=15;POP_AF=1.000e-03,1.000e-03;P_GERMLINE=-2.169e-04,-2.325e-04;TLOD=11.46,7.14 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1/2:5,35,20:0.500,0.333:6:0,2,1:1,1,1:34,35:112,143,117:42,45:29,47:0.444,0.485,0.500:0.037,0.019,0.944 20 42199704 . GGT G,GGTGGGTGGGTGTGTGT . germline_risk DP=100;ECNT=2;POP_AF=0.112,0.024;P_GERMLINE=-2.964e-04,-8.826e-06;TLOD=3.76,9.83 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1/2:1,2,7:0.168,0.301:20:1,1,4:9,1,1:34,35:147,203,146:60,60:51,62:0.192,0.253,0.263:0.038,0.014,0.948 的行。然后,在最后一列> 2

中,第一个DP > 45之后具有值的行

因此,在第一行中,我们可以看到DP为= 1,而在:之后的第一个值在最后一列= 0

在第二行中,DP为= 60,而在:之后的第一个值在最后一列= 5

从上面的示例输入文件中,首先我们应该获得:

:

第二秒之后我们应该得到:

19  27733067    .   A   G,C .   clustered_events;contamination;germline_risk;multiallelic   DP=60;ECNT=15;POP_AF=1.000e-03,1.000e-03;P_GERMLINE=-2.169e-04,-2.325e-04;TLOD=11.46,7.14   GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB  0/1/2:5,35,20:0.500,0.333:6:0,2,1:1,1,1:34,35:112,143,117:42,45:29,47:0.444,0.485,0.500:0.037,0.019,0.944
20  42199704    .   GGT G,GGTGGGTGGGTGTGTGT .   germline_risk   DP=100;ECNT=2;POP_AF=0.112,0.024;P_GERMLINE=-2.964e-04,-8.826e-06;TLOD=3.76,9.83    GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB  0/1/2:1,2,7:0.168,0.301:20:1,1,4:9,1,1:34,35:147,203,146:60,60:51,62:0.192,0.253,0.263:0.038,0.014,0.948

有什么帮助吗?

5 个答案:

答案 0 :(得分:3)

grep是错误的工具,用于尝试比较数字以查看它们是否大于或小于。

他是一个perl单线打印机,可以打印出符合两种条件的行:

perl -ane 'print if $F[7] =~ /DP=(\d+)/ && $1 > 45 && $F[9] =~ /:(\d+)/ && $1 > 2' input.txt

答案 1 :(得分:2)

如果您坚持使用grep,则可以得到DP> 45 by

grep 'DP=\(4[6-9]\|[5-9][0-9]\|[1-9][0-9]\{2,\}\)[^0-9]'
#            |         |            |
#          46-49       |          100..∞
#                    50-99

答案 2 :(得分:1)

请您尝试以下。

awk '
{
  split($8,array,"[;=]")
  if(array[1]=="DP" && array[2]>45){
    split($10,array1,"[:,]")
    if(array1[2]>2){
       print
    }
  }
}'  Input_file

说明: 现在添加上述代码的说明。

awk '                                    ##Starting awk program here.
{                                        ##Starting block for statements here.
  split($8,array,"[;=]")                 ##Using awk out of box function split for splitting 8th field and saving it to array with delimiter ;=
  if(array[1]=="DP" && array[2]>45){     ##Checking condition if 1st element of array is DP and 2nd element value is greater than 45 then:
    split($10,array1,"[:,]")             ##Splitting 10th  field to array1 with delkimter : and , here.
    if(array1[2]>2){                     ##Checking condition if array1 2nd element if its value is greater than 2 then do following.
       print                             ##Printing the current line value here.
    }                                    ##Closing block for above if condition here.
  }                                      ##Closing block for previous if condition here.
}' Input_file                            ##Mentioning Input_file name here.

答案 3 :(得分:1)

使用GNU awk将第三个参数匹配():

$ awk 'match($0,/ DP=([^;]+).* [^:]+:([^,]+)/,a) && (a[1] > 45) && (a[2] > 2)' file
19  27733067    .   A   G,C .   clustered_events;contamination;germline_risk;multiallelic   DP=60;ECNT=15;POP_AF=1.000e-03,1.000e-03;P_GERMLINE=-2.169e-04,-2.325e-04;TLOD=11.46,7.14   GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB  0/1/2:5,35,20:0.500,0.333:6:0,2,1:1,1,1:34,35:112,143,117:42,45:29,47:0.444,0.485,0.500:0.037,0.019,0.944

答案 4 :(得分:1)

使用正确的工具进行工作,有关更多信息,请参见“ bcftools视图” 选项,类似这样的方法应该起作用:

bcftools view -i 'INFO/DP > 45 & FORMAT/AD[0:0] > 2' myFile.vcf

bcftools manuals中的更多选项:

INFO/AF[0] > 0.3             .. first AF value bigger than 0.3
FORMAT/AD[0:0] > 30          .. first AD value of the first sample bigger than 30
FORMAT/AD[0:1]               .. first sample, second AD value
FORMAT/AD[1:0]               .. second sample, first AD value
DP4[*] == 0                  .. any DP4 value
FORMAT/DP[0]   > 30          .. DP of the first sample bigger than 30
FORMAT/DP[1-3] > 10          .. samples 2-4
FORMAT/DP[1-]  < 7           .. all samples but the first
FORMAT/DP[0,2-4] > 20        .. samples 1, 3-5
FORMAT/AD[0:1]               .. first sample, second AD field
FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field
FORMAT/AD[*:1] or AD[:1]        .. any sample, second AD field
(DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
CSQ[*] ~ "missense_variant.*deleterious"