我有一个格式为:
的csv文件"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
我希望将第一列唯一ID和concat类型分组在一行中,如下所示:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
我发现awk在处理此类场景方面做得很好。但我所能做的就是:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"
我使用了这个命令:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
如何删除重复项并处理第二列类型的格式化?
答案 0 :(得分:2)
快速修复:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
!seen[$0]++
只有在尚未看到行的情况下才会成立
如果第二列都应在双引号内
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"
答案 1 :(得分:2)
使用GNU awk实现真正的多维数组和gensub()以及sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
输出行和列都将按升序排序(即按字母顺序排列)。
答案 2 :(得分:1)
简称GNU datamash + tr 解决方案:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'
输出:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"
<强> ---------- 强>
如果之间的项目双引号应该被删除 - 请使用以下替代方法:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'
输出:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
答案 3 :(得分:0)
awk + 排序解决方案:
awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)
输出:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
答案 4 :(得分:0)
对于示例,输入低于1将起作用,但未排序
<强>一衬垫强>
# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
# using regexp
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
测试结果:
$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
更好的可读性:
使用regexp
awk 'BEGIN{
FS=OFS="|"
}
{
a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
}
END{
for(i in a)
print i,a[i]
}
' infile
使用两个数组
awk 'BEGIN{
FS=OFS="|"
}
!seen[$1,$2]++{
a[$1] = ($1 in a ? a[$1] ":" : "") $2
}
END{
for(i in a)
print i,a[i]
}' infile
注意:您也可以使用
!seen[$0]++
,它会使用整行作为索引,但如果是真实数据, 你想要更喜欢其他一些专栏,你可能更喜欢!seen[$1,$2]++
, 这里column1和column2用作索引