使用不同的分隔符和位置提取多个列

时间:2014-03-13 10:27:29

标签: bash awk multiple-columns

专家,

我有一个问题。我有一个包含多个列和行的大数据文件。前两列由制表符分隔符分隔,第二部分由“;”分隔。我想提取前五列。并从“;”分开的部分是EUR_AF =列和AF =并将其放在一个新文件中。

文件示例(2行):

13  19020013    rs181615907 C   T   100 PASS    AA=.;AC=83;AF=0.12;AFR_AF=0.05;AMR_AF=0.15;AN=758;ASN_AF=0.17;AVGPOST=0.8701;ERATE=0.0007;EUR_AF=0.11;LDAF=0.1423;RSQ=0.6009;SNPSOURCE=LOWCOV;THETA=0.0051;VT=SNP   
13  19020047    rs186129910 A   .   100 PASS    AA=.;AC=0;AF=0.0005;AFR_AF=0.0020;AN=758;AVGPOST=0.9992;ERATE=0.0005;LDAF=0.0008;RSQ=0.4992;SNPSOURCE=LOWCOV;THETA=0.0112;VT=SNP
13  19020095    rs140871821 C   T   100 PASS    AA=.;AC=38;AF=0.05;AFR_AF=0.08;AMR_AF=0.05;AN=758;ASN_AF=0.03;AVGPOST=0.9904;ERATE=0.0005;EUR_AF=0.05;LDAF=0.0538;RSQ=0.9245;SNPSOURCE=LOWCOV;THETA=0.0069;VT=SNP

我试过了:

awk -F'[\t;]' ' NR > 30 {
    for (i = 1; i <= NF; i++) {
        if ($i ~ /EUR_AF/) {
        printf $1 " " $2 " " $3 " " $4 " " $5 " " $10 " " "%s ", $i
        }
    }
    print ""
}' head50.txt

输出:

13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11 

13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05 
13 19020145 rs57048904 G T AF=0.61 EUR_AF=0.73 
13 19020341 rs184229798 C T AF=0.03 EUR_AF=0.09 
13 19020627 rs12018140 A G AF=0.70 EUR_AF=0.71 

问题: 现在有缺少的行(第二个),其中没有填写EUR_AF部分。我希望看到这些行以及AF =参数,如下所示:

13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11 
13 19020047 rs186129910 A . AF=0.0005
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05 
13 19020145 rs57048904 G T AF=0.61 EUR_AF=0.73 
13 19020341 rs184229798 C T AF=0.03 EUR_AF=0.09 
13 19020627 rs12018140 A G AF=0.70 EUR_AF=0.71 

希望有人可以帮助我。

提前致谢。

1 个答案:

答案 0 :(得分:0)

这是获得您想要的智能方式:

awk '{split($8,a,";AF=");split($8,b,";EUR_AF=");print $1,$2,$3,$4,$5,"AF="a[2]+0,"EUR_AF="b[2]+0}' file
13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11
13 19020047 rs186129910 A . AF=0.0005 EUR_AF=0
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05

它将为EUR_AF=0行打印2,因为它不存在。

如果你根本不喜欢它,你可以测试它:

awk '{split($8,a,";AF=");split($8,b,";EUR_AF=");print $1,$2,$3,$4,$5,"AF="a[2]+0,(b[2]?"EUR_AF="b[2]+0:"")}' file
13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11
13 19020047 rs186129910 A . AF=0.0005
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05