专家,
我有一个问题。我有一个包含多个列和行的大数据文件。前两列由制表符分隔符分隔,第二部分由“;”分隔。我想提取前五列。并从“;”分开的部分是EUR_AF =列和AF =并将其放在一个新文件中。
文件示例(2行):
13 19020013 rs181615907 C T 100 PASS AA=.;AC=83;AF=0.12;AFR_AF=0.05;AMR_AF=0.15;AN=758;ASN_AF=0.17;AVGPOST=0.8701;ERATE=0.0007;EUR_AF=0.11;LDAF=0.1423;RSQ=0.6009;SNPSOURCE=LOWCOV;THETA=0.0051;VT=SNP
13 19020047 rs186129910 A . 100 PASS AA=.;AC=0;AF=0.0005;AFR_AF=0.0020;AN=758;AVGPOST=0.9992;ERATE=0.0005;LDAF=0.0008;RSQ=0.4992;SNPSOURCE=LOWCOV;THETA=0.0112;VT=SNP
13 19020095 rs140871821 C T 100 PASS AA=.;AC=38;AF=0.05;AFR_AF=0.08;AMR_AF=0.05;AN=758;ASN_AF=0.03;AVGPOST=0.9904;ERATE=0.0005;EUR_AF=0.05;LDAF=0.0538;RSQ=0.9245;SNPSOURCE=LOWCOV;THETA=0.0069;VT=SNP
我试过了:
awk -F'[\t;]' ' NR > 30 {
for (i = 1; i <= NF; i++) {
if ($i ~ /EUR_AF/) {
printf $1 " " $2 " " $3 " " $4 " " $5 " " $10 " " "%s ", $i
}
}
print ""
}' head50.txt
输出:
13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05
13 19020145 rs57048904 G T AF=0.61 EUR_AF=0.73
13 19020341 rs184229798 C T AF=0.03 EUR_AF=0.09
13 19020627 rs12018140 A G AF=0.70 EUR_AF=0.71
问题: 现在有缺少的行(第二个),其中没有填写EUR_AF部分。我希望看到这些行以及AF =参数,如下所示:
13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11
13 19020047 rs186129910 A . AF=0.0005
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05
13 19020145 rs57048904 G T AF=0.61 EUR_AF=0.73
13 19020341 rs184229798 C T AF=0.03 EUR_AF=0.09
13 19020627 rs12018140 A G AF=0.70 EUR_AF=0.71
希望有人可以帮助我。
提前致谢。
鲁
答案 0 :(得分:0)
这是获得您想要的智能方式:
awk '{split($8,a,";AF=");split($8,b,";EUR_AF=");print $1,$2,$3,$4,$5,"AF="a[2]+0,"EUR_AF="b[2]+0}' file
13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11
13 19020047 rs186129910 A . AF=0.0005 EUR_AF=0
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05
它将为EUR_AF=0
行打印2
,因为它不存在。
如果你根本不喜欢它,你可以测试它:
awk '{split($8,a,";AF=");split($8,b,";EUR_AF=");print $1,$2,$3,$4,$5,"AF="a[2]+0,(b[2]?"EUR_AF="b[2]+0:"")}' file
13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11
13 19020047 rs186129910 A . AF=0.0005
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05