awk显示2个文件的差异

时间:2016-03-02 20:59:51

标签: awk

我正在尝试使用awk显示两个文件之间的计数和差异(如果有)。下面的awk会在file2中显示$3的唯一计数,但是如何显示未找到的ID?谢谢你:)。

文件1

ACTA2
ACTC1
APC
APOB
BRCA1
BRCA2

file2 ACTA2, ACTC1, APC are all unique so they are used in the count

chr10:90694965-90695138 ACTA2-1269|gc=52.6 639.7
chr10:90697803-90698014 ACTA2-1270|gc=50.2 347.6
chr15:35082598-35082771 ACTC1-254|gc=50.3 603.8
chr15:35085431-35085785 ACTC1-258|gc=54.8 633.8
chr15:35086866-35087046 ACTC1-259|gc=67.2 291.0
chr5:112043405-112043589 APC-1396|gc=70.1 334.8
chr5:112090578-112090732 APC-1397|gc=39.6 171.6
chr5:112102006-112102125 APC-1398|gc=33.6 52.3
chr5:112102876-112103097 APC-1399|gc=41.2 177.4

AWK

awk -F '[- ]' '!seen[$3]++ {n++} END {print n " ids found)}' file2    

期望的结果comes from file2 - 已经有效)

3 ids found和APOB,BRCA1,BRCA2失踪

2 个答案:

答案 0 :(得分:1)

这让你非常接近你想要的输出:

$ awk -F'[ -]' 'NR == FNR { seen[$0]; next } !seen[$3]++ { n++ }
END { print n " ids found"; for (i in seen) if (!seen[i]) print i " missing" }' file1 file2
3 ids found
APOB missing
BRCA1 missing
BRCA2 missing

它基本上遍历seen数组并检查值。如果在第二个文件中没有看到!seen[i],则为真。

答案 1 :(得分:1)

这是一个原型

$ awk -F '[- ]' 'NR==FNR{a[$0];next} 
               ($3 in a){delete a[$3]} 
                    END {for(k in a) printf "%s ",k; print "missing"}' file{1,2}

BRCA1 BRCA2 APOB missing

右输出格式

$ awk -F '[- ]' 'NR==FNR{a[$0];next} 
               ($3 in a){delete a[$3]; c++} 
                     END{printf "%s ids found and ", c; 
                         for(k in a) {printf "%s",sep k; sep=","} 
                         print " missing"}' file{1,2}

3 ids found and BRCA1,BRCA2,APOB missing