我有一个输入数据,如:
chr17 41243232 41243373 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23
chr13 32954112 32954208 BRCA2_ex24
我需要检查重复行$2
和$3
行,如果重复,我需要合并为一行,$4
列打印为逗号分隔。
输出:
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
是否有任何AWK解决方案可以轻松处理此类数据?我很感激解释解决方案。输入和输出是制表符分隔格式。注意:第一,第二和第三个字段是相同的。
我的尝试是:
awk -v OFS="\t" '{i=$2 FS $1 FS $3 FS $4} {a[i]=!a[i]?$4:a[i] "," $4} END {for (l in a) {print l,a[l]}}' infile
感谢您的任何想法。
答案 0 :(得分:2)
$ cat script.awk
{
a[$2 OFS $3] = $1 # store $1, last instance
b[$2 OFS $3] = b[$2 FS $3] $4 "," # append the $4s
}
END {
for (i in a) { # order is awk default
sub(/,$/, "", b[i]) # remove trailing ","
print a[i], i, b[i] # print
}
}
执行命令
$ awk -f script.awk infile
chr17 41243471 41243644 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
答案 1 :(得分:1)
只需用
替换第一个作业 i=$1 FS $2 FS $3
并且可能通过sed
过滤输出以用标签替换空格:
... | sed 's/ / /g'
space---^ ^--- TAB
输出:
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
chr17 41243639 41243811 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
答案 2 :(得分:1)
如果perl
没问题:
$ cat ip.txt
chr17 41243232 41243373 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23
chr13 32954112 32954208 BRCA2_ex24
$ perl -ale '$k = join "\t",@F[0..2]; $h{$k} .= $h{$k} ? ",$F[3]" : $F[3]; END{ print "$_\t$h{$_}" foreach (keys %h) }' ip.txt
chr17 41243639 41243811 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
-ale
在空格上拆分输入行并保存到@F
数组,从输入行中删除换行符并为打印语句添加换行符$k = join "\t",@F[0..2]
使用密钥 - 由tab
$h{$k} .= $h{$k} ? ",$F[3]" : $F[3]
将值附加到哈希变量,根据现有值添加,
是否为空END{ print "$_\t$h{$_}" foreach (keys %h) }
,打印键和值由tab
分隔。密钥顺序是随机的使用正则表达式提取键值的替代方法:
$ perl -nle '($k,$v)=/^(.*?)\s+(\S+)$/; $h{$k} .= $h{$k} ? ",$v" : $v; END{print "$_\t$h{$_}" foreach (keys %h) }' ip.txt
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
chr17 41243639 41243811 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
答案 3 :(得分:1)
$ cat tst.awk
{
curr = $2 FS $3
if (curr == prev) {
buf = buf "," $NF
}
else {
if (NR>1) {
print buf
}
buf = $0
}
prev = curr
}
END { print buf }
$ awk -f tst.awk file
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
这与@JamesBrown's solution之间的区别是: