同时过滤多个列并提取行

时间:2017-08-24 11:13:08

标签: awk filter

我有一个这样的文件:

[1]CHROM [2]POS [3]REF [4]ALT [5]GT_MA [6]GT_PA [7]GT_HI
1 13380 C G 0/1 0/1 0/1
1 13504 G A 0/0 0/0 0/0
1 17361 T * ./. 0/0 0/0
1 17365 C G ./. 0/0 0/0
1 17373 A G 0/0 ./. 0/0
1 17375 A G 0/1 0/1 1/1
1 17378 C T 1/1 0/1 1/1
1 17379 G A 0/0 ./. 0/0
1 17385 G A 0/0 ./. 0/0
1 17398 C A ./. ./. ./.
1 17403 A G 0/0 ./. ./.
1 17406 C T 0/0 ./. ./.
1 17407 G A 0/0 ./. ./.
1 17408 C G 0/0 ./. ./.
1 17452 C T 0/0 0/0 0/0
1 17478 C T 0/0 0/0 0/0
1 17479 G A 0/0 0/0 0/0
1 17483 C T 0/0 0/0 0/0
1 17484 G A 0/1 1/1 1/1
15 52640990 TAA TAAA,TAAAA,TA,T,TAAAAA 1/3 1/1 0/1
15 72252189 TAAA TAAAA,TAA,T,TAAAAA,TA,TAAAAAA 0/0 0/1 1/2

我想在$ 5,$ 6和$ 7中提取具有不同值组合的所有行。例如$ 5 = 0/1,$ 6 = 0/1,$ 7 = 0/1; $ 5 = 0/1,$ 6 = 0/1,$ 7 = 1/1; $ 5 = 1/1,$ 6 = 0/1,$ 7 = 1/1和$ 5 = 0/1,$ 6 = 1/1,$ 7 = 1/1。

预期产出:

   [1]CHROM [2]POS [3]REF [4]ALT [5]GT_MA [6]GT_PA [7]GT_HI
    1 13380 C G 0/1 0/1 0/1
    1 17375 A G 0/1 0/1 1/1
    1 17378 C T 1/1 0/1 1/1
    1 17484 G A 0/1 1/1 1/1

我试图像这样做一个没有结果的单独过滤器。

awk -F '\t' '{ if(($5 = 0/1) && ($6 =0/1) && ($7 = 0/1)) { print }}' file1 > file2out

我不确定是否可以使用awk来执行此操作。谢谢你的帮助!

2 个答案:

答案 0 :(得分:1)

这个awk单行可能会有所帮助:

 awk '{s=$5 FS $6 FS $7}s!~"[.]/[.]" && s~/[1-9]/ && !a[s]++' file

输出:

[1]CHROM [2]POS [3]REF [4]ALT [5]GT_MA [6]GT_PA [7]GT_HI
1 13380 C G 0/1 0/1 0/1
1 17375 A G 0/1 0/1 1/1
1 17378 C T 1/1 0/1 1/1
1 17484 G A 0/1 1/1 1/1

答案 1 :(得分:0)

请您试着跟随并告诉我这是否对您有帮助。

 awk 'NR==1{print;next} !a[$5,$6,$7]++ && $0 !~ /\.\/\./'   Input_file

编辑:您可以尝试一次。

awk 'NR==1{print;next} !a[$5,$6,$7]++ && $0 !~ /\.\/\./ && ($0 !~ /[2-9]\// || $0 !~ /\/[2-9]/)'  Input_file

EDIT1:我们说我们有以下Input_file。

cat Input_file
[1]CHROM [2]POS [3]REF [4]ALT [5]GT_MA [6]GT_PA [7]GT_HI
1 13380 C G 0/1 0/1 0/1
1 13504 G A 0/0 0/0 0/0
1 17361 T * ./. 0/0 0/0
1 17365 C G ./. 0/0 0/0
1 17373 A G 0/0 ./. 0/0
1 17375 A G 0/1 0/1 1/1
1 17378 C T 1/1 0/1 1/1
1 17379 G A 0/0 ./. 0/0
1 17385 G A 0/0 ./. 0/0
1 17398 C A ./. ./. ./.
1 17403 A G 0/0 ./. ./.
1 17406 C T 0/0 ./. ./.
1 17407 G A 0/0 ./. ./.
1 17408 C G 0/0 ./. ./.
1 17452 C T 0/0 0/0 0/0
1 17478 C T 0/0 0/0 0/0
1 17479 G A 0/0 0/0 0/0
1 17483 C T 0/0 0/0 0/0
1 17484 G A 0/1 1/1 1/1
1 17408 C G 0/0 ./. ./.
1 17452 C T 0/0 0/0 0/0
1 17478 C T 0/0 0/0 0/0
1 17479 G A 0/0 0/0 0/0
1 17483 C T 2/0 0/3 0/1
1 17484 G A 2/3 1/2 1/3

当我在EDIT中运行代码时,它会给我以下结果。

awk 'NR==1{print;next} !a[$5,$6,$7]++ && $0 !~ /\.\/\./ && ($0 !~ /[2-9]\// || $0 !~ /\/[2-9]/)' Input_file
[1]CHROM [2]POS [3]REF [4]ALT [5]GT_MA [6]GT_PA [7]GT_HI
1 13380 C G 0/1 0/1 0/1
1 13504 G A 0/0 0/0 0/0
1 17375 A G 0/1 0/1 1/1
1 17378 C T 1/1 0/1 1/1
1 17484 G A 0/1 1/1 1/1