使用bash / awk从csv中删除重复项

时间:2017-10-12 13:45:59

标签: bash csv awk duplicates

我有一个格式为:

的csv文件
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"

我希望将第一列唯一ID和concat类型分组在一行中,如下所示:

"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

我发现awk在处理此类场景方面做得很好。但我所能做的就是:

"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"

我使用了这个命令:

awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file

如何删除重复项并处理第二列类型的格式化?

5 个答案:

答案 0 :(得分:2)

快速修复:

$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file 
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
  • !seen[$0]++只有在尚未看到行的情况下才会成立


如果第二列都应在双引号内

$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
                 !seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
                 END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"

答案 1 :(得分:2)

使用GNU awk实现真正的多维数组和gensub()以及sorted_in:

$ awk -F'|' '
    { a[$1][gensub(/"/,"","g",$2)] }
    END {
        PROCINFO["sorted_in"] = "@ind_str_asc"
        for (i in a) {
            c = 0
            for (j in a[i]) {
                printf "%s%s", (c++ ? ":" : i "|\""), j
            }
            print "\""
        }
    }
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

输出行和列都将按升序排序(即按字母顺序排列)。

答案 2 :(得分:1)

简称GNU datamash + tr 解决方案:

datamash -st'|' -g1 unique 2 <file | tr ',' ':'

输出:

"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"

<强> ----------

如果之间的项目双引号应该被删除 - 请使用以下替代方法:

datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'

输出:

"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

答案 3 :(得分:0)

awk + 排序解决方案:

awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
           END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)

输出:

"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

答案 4 :(得分:0)

对于示例,输入低于1将起作用,但未排序

<强>一衬垫

# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile

# using regexp 
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] :  a[$1]":"$2  ) : $2}END{for(i in a)print i,a[i]}' infile

测试结果:

$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"

$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"    

$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] :  a[$1]":"$2  ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"

更好的可读性:

使用regexp

awk 'BEGIN{
           FS=OFS="|"
     }
     { 
           a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
     }
     END{
           for(i in a)
              print i,a[i]
     }
     ' infile

使用两个数组

awk 'BEGIN{
          FS=OFS="|"
     }
     !seen[$1,$2]++{ 
             a[$1] = ($1 in a ? a[$1] ":" : "") $2
     }
  END{
           for(i in a)
               print i,a[i]
     }' infile
  

注意:您也可以使用!seen[$0]++,它会使用整行作为索引,但如果是真实数据,   你想要更喜欢其他一些专栏,你可能更喜欢!seen[$1,$2]++,   这里column1和column2用作索引