输入1
1 10611 2 122 C :0.983607 G :0.0163934
输入2
1 10611 rs146752890 C G 100 PASS AC = 184; RSQ = 0.8228; AVGPOST = 0.9640; AN = 2184; ERATE = 0.0031 ; VT = SNP; AA = .; THETA = 0.0127; LDAF = 0.0902; SNPSOURCE = LOWCOV; AF = 0.08; ASN_AF = 0.08; AMR_AF = 0.14; AFR_AF = 0.08; EUR_AF = 0.07
这里 第1列和第2列匹配并且第一个文件的第5列的':'之前的值和第2个文件的第4列是equel,第2个文件的第1列和第5列的第6列(':'之前的值)是equel并且输出正在创建基于这个匹配。将从输入和输出行获得清晰的想法,两个文件都是.gz文件
输出
1 10611 rs146752890 CG 100 PASS AC = 184; RSQ = 0.8228; AVGPOST = 0.9640; AN = 2184; ERATE = 0.0031; VT = SNP; AA =。; THETA = 0.0127; LDAF = 0.0902; SNPSOURCE = LOWCOV; AF = 0.08; ASN_AF = 0.08; AMR_AF = 0.14; AFR_AF = 0.08; EUR_AF = 0.07;的 REF = 0.983607; ALT = 0.0163934
答案 0 :(得分:1)
这是使用awk
的一种方式:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' input1 input2
结果:
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;
因此,对于压缩文件,请尝试:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' <(gzip -dc input1.gz) <(gzip -dc input2.gz) | gzip > output.gz
编辑:
从下面的评论中,试试这个:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $1, $2, $3, $4, $5, $6, $7, c[$1,$2,$4,$5] $8 ";" }' file1 file2
结果:
1 10611 rs146752890 C G 100 PASS REF=0.983607;ALT=0.0163934;AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;
答案 1 :(得分:0)
这应该有效(假设你有足够的磁盘空间来存储扩展的.gz文件):
zcat 1 | awk '{print $1$2,$0}' | sort > new1
zcat 2 | awk '{print $1$2,$0}' | sort > new2
join new1 new2 -11 -21 -o "2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 1.6 1.7"|sed 's/ C:/;REF=/'|sed 's/ G:/;ALT=/' > output