我有以下CSV文件:
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en
我想根据第一列的值删除重复项。如果有多个相同值的记录,我只想在新文件中保留一个:
我开始使用以下实际找到重复的内容,但我想创建一个新文件,而不仅仅是打印:
sort input.csv | awk 'NR == 1 {p=$1; next} p == $1 { print $1 " is duplicated"} {p=$1}' FS=","
答案 0 :(得分:6)
Nut 100%确定你喜欢什么,但只有在等于时才会得到最后一个输入:
awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}' file > newfile
cat newfile
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
1393036,293296,68,59,Mithridates,ny,io
353,291765,434,434,Lar,ny,en
10155431,14595886,1807,135860,Riemogerz,ny,id
19332,7401441,296,352647,WikiDreamer,ny,fr
如果字段1
是唯一的,那么保留哪条记录并不重要
如果有几个相等的话,这将显示第一个命中:
awk -F, '!a[$1]++' file > newfile
cat newfile
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
要将复制到新文件中:
awk -F, '++a[$1]==2 {print $1}' file > newfile
cat newfile
1536088
353
答案 1 :(得分:0)
这将仅显示给定第一列值的第一个条目:
awk -F, '!(seen[$1]++)' file > newfile