在我的输出文件中,我有800,000行和8个字段,共3个样本。我这里只提取2行。我只想提取每行的一些具体信息,例如: chr,位置,SNP-ID,质量,DP,QD,基因型(./.0,0/0/1或1/1)。我需要一个脚本来提取这些信息并创建新文件:请你指教。感谢
#chr pos SNP-ID Qual Info geno(sample1) geno(sample2) geno(sample3)
chrM 152 rs117135796 7427.14 AC=2;AF=0.333;AN=6;BaseQRankSum=-20.485;DB;DP=702;DS;Dels=0.00;FS=167.659;HaplotypeScore=2.6106;MLEAC=2;MLEAF=0.333;MQ=50.00;MQ0=0;MQRankSum=-1.507;QD=36.77;ReadPosRankSum=12.041 0/0:250,0:237:99:0,701,10320 0/0:250,0:238:99:0,713,10507 1/1:0,202:192:99:7465,572,0
chr10 5874 rs118203891 33.13 AC=1;AF=0.167;AN=6;BaseQRankSum=1.454;DB;DP=657;DS;Dels=0.00;FS=124.424;HaplotypeScore=5.1214;MLEAC=1;MLEAF=0.167;MQ=45.31;MQ0=0;MQRankSum=2.462;QD=0.15;ReadPosRankSum=-8.096 0/1:204,24:206:64:64,0,6345 0/0:203,0:193:99:0,473,6944 0/0:226,0:215:99:0,524,6448
答案 0 :(得分:2)
尝试:
awk -f ext.awk data.txt > summary.txt
其中data.txt
是您的输入数据文件,ext.awk
是:
NR>1 {
match($5,/(DP=[^;]+);/,a)
DP=a[1]
match($5,/(QD=[^;]+);/,a)
QD=a[1]
match($6,/^([^:]+\/[^:]+):/,a)
gt1=a[1]
match($7,/^([^:]+\/[^:]+):/,a)
gt2=a[1]
match($8,/^([^:]+\/[^:]+):/,a)
gt3=a[1]
print $1,$2,$3,$4,DP,QD,gt1,gt2,gt3
}
假设基因型由每个字段的3个第一个字符(从$ 6到$ NF)给出,您可以尝试以下方法:
NR>1 {
match($5,/(DP=[^;]+);/,a)
DP=a[1]
match($5,/(MQ=[^;]+);/,a)
MQ=a[1]
printf "%s %s %s %s %s %s ", $1,$2,$3,$4,DP,MQ
for (i=6; i<=NF; i++) {
printf "%s", substr($i,1,3)
if (i<NF) printf " "
else printf "\n"
}
}
如果你想:
然后你可以尝试:
BEGIN {
geno["./."]="NA"
geno["0/0"]="0"
geno["0/1"]="1"
geno["1/1"]="2"
}
NR>1 {
match($5,/(DP=[^;]+);/,a)
DP=a[1]
match(DP,/=(.*)$/,a)
dpv=a[1]
match($5,/(MQ=[^;]+);/,a)
MQ=a[1]
match(MQ,/=(.*)$/,a)
mqv=a[1]
if (dpv<10 || mqv<50) next
else {
printf "%s %s %s %s %s %s ", $1,$2,$3,$4,DP,MQ
for (i=6; i<=NF; i++) {
type=substr($i,1,3)
printf "%s", geno[type]
if (i<NF) printf " "
else printf "\n"
}
}
}
答案 1 :(得分:1)
Perl提供了一个很好的简洁程序:
perl -ane '
BEGIN {$, = " "}
@fields = @F[0..3];
push @fields, $1, $2 if $F[4] =~ /(DP=.+?);.*(QD=.+?);/;
push @fields, (split /:/)[0] for @F[5,6,7];
print @fields, "\n";
' <<END
chrM 152 rs117135796 7427.14 AC=2;AF=0.333;AN=6;BaseQRankSum=-20.485;DB;DP=702;DS;Dels=0.00;FS=167.659;HaplotypeScore=2.6106;MLEAC=2;MLEAF=0.333;MQ=50.00;MQ0=0;MQRankSum=-1.507;QD=36.77;ReadPosRankSum=12.041 0/0:250,0:237:99:0,701,10320 0/0:250,0:238:99:0,713,10507 1/1:0,202:192:99:7465,572,0
chr10 5874 rs118203891 33.13 AC=1;AF=0.167;AN=6;BaseQRankSum=1.454;DB;DP=657;DS;Dels=0.00;FS=124.424;HaplotypeScore=5.1214;MLEAC=1;MLEAF=0.167;MQ=45.31;MQ0=0;MQRankSum=2.462;QD=0.15;ReadPosRankSum=-8.096 0/1:204,24:206:64:64,0,6345 0/0:203,0:193:99:0,473,6944 0/0:226,0:215:99:0,524,6448
END
chrM 152 rs117135796 7427.14 DP=702 QD=36.77 0/0 0/0 1/1
chr10 5874 rs118203891 33.13 DP=657 QD=0.15 0/1 0/0 0/0