Question

我需要一些帮助来修复我的代码以处理制表符分隔的数据集。示例数据是：

#ID type
A   3
A   Ct
A   Ct
A   chloroplast
B   Ct
B   Ct
B   chloroplast
B   chloroplast
B   4
C   Ct
C   Ct
C   chloroplast

对于第1列中的每个唯一元素，我想计算与模式“Ct”匹配的元素和不匹配的元素。所以期望的产出是

#ID  count_for_matches count_for_unmatched
A   2   2
B   2   3
C   2   1

我可以通过此

获得模式匹配的计数

awk '$2~/Ct/{x++};$2!~/Ct/{y++}END{print x,y}

我知道我可以通过将＃1列定义为

这样的数组来对每个项目进行处理

awk '{a[$1]++}END{for (i in a) print i}'

但我无法将两个部分组合成功能代码。我试过像

这样的组合

awk '{a[$1]++}END{for (i in a){$2~/Ct/{x++};$2!~/Ctt/{y++}}END{print i,x,y}}}'

但我显然犯了一些错误，我根据论坛的答案无法弄清楚如何解决这个问题。也许$ 2值应该以[$ 1]存储？如果有人能指出错误，我将不胜感激！

Answer 1

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==1 { next }
!seen[$1]++ { keys[++numKeys] = $1 }
$2=="Ct" { matches[$1]++; next }
{ unmatched[$1]++ }
END {
    print "#ID", "count_for_matches", "count_for_unmatched"
    for (keyNr=1; keyNr<=numKeys; keyNr++) {
        key = keys[keyNr]
        print key, matches[key]+0, unmatched[key]+0
    }
}

$ awk -f tst.awk file
#ID     count_for_matches       count_for_unmatched
A       2       2
B       2       3
C       2       1

Answer 2

这是另一个极简主义版本

$ awk 'NR==1{print $1,"count_for_matches","count_for_unmatches";next}
    $2=="Ct"{m[$1]++} 
            {a[$1]++} 
         END{for(k in a) print k, m[k], a[k]-m[k]}' file | 
 column -t

#ID  count_for_matches  count_for_unmatches
A    2                  2
B    2                  3
C    2                  1

计数模式匹配awk数组中的元素

2 个答案: