我想问一个跟我之前发布的问题:
awk compare columns from two files, impute values of another column
当我有几个不匹配的值时,我想知道如何打印NA
。
File1
rs1 AA 10
rs2 BB 20
rs3 CC 30
rs4 DD 40
File2
rs1 QQ TT UU
rs3 RR WW
rs4 ZZ
Desired output
rs1 AA 10 QQ TT UU
rs2 DD 20 NA NA NA
rs3 EE 30 RR WW NA
rs4 RR 40 ZZ NA NA
此代码仅在缺少整个NA
时打印$0
:
awk 'FNR==NR{a[$1]=$0;next}{print $0,a[$1]?a[$1]:"NA"}' file2 file1
Current output:
rs1 AA 10 QQ TT UU
rs2 DD 20 NA
rs3 EE 30 RR WW
rs4 RR 40 ZZ
答案 0 :(得分:1)
有人这样吗?
awk 'FNR==NR{for (i=2;i<=NF;i++) a[i,$1]=$i;next}{printf "%s\t",$0; for (i=2;i<=6;i++) printf "%s\t",(a[i,$1]?a[i,$1]:"NA");print ""}' f2 f1
rs1 AA 10 QQ TT UU NA NA
rs2 BB 20 NA NA NA NA NA
rs3 CC 30 RR WW NA NA NA
rs4 DD 40 ZZ NA NA NA NA
由于您有一个大文件,您需要将循环设置为您喜欢的列数
答案 1 :(得分:1)
试试这个:
awk '
BEGIN {OFS = "\t"}
FNR == NR {
if (NF > 1) {
if (NF > maxnf) maxnf = NF
nf[$1] = NF
a[$1] = $2
for (i = 3; i <= NF; ++i) a[$1] = a[$1] "\t" $i
}
next
}
{
if (NF < 3) {$3 = $2; $2 = " "}
else $1 = $1 # ensure fields are separated by tabs
printf($0)
n = 1
if ($1 in a) {n = nf[$1]; printf("\t%s", a[$1])}
for (i = n; i < maxnf; ++i) printf("\tNA");
print""
}
' file2 file1
这假设file1
具有固定数量的列。在输出中,列由制表符分隔。
对于以空格分隔的输出,将输出通过管道传输到expand -t 6
,或者您希望使用tabstops。使用-t 6
,它看起来像这样:
rs1 AA 10 QQ TT UU
rs2 BB 20 NA NA NA
rs3 CC 30 RR WW NA
rs4 DD 40 ZZ NA NA