想检查重复的每个field
级别,如果重复,则填充"Yes"
"No"
。
然后填充该字段的counter
的增量计数。然后检查整行是重复的还是唯一的。
Input.csv
Name,Age,Sub
abc,10,eee
def,20,csc
abc,30,mec
ghi,40,sss
abc,10,eee
def,10,csc
期望的输出:
Name,Age,Sub,Name_Dup,Name_Counter,Age_Dup,Age_Counter,Sub_Dup,Sub_Counter,EntireLine_Dup,EntireLine_Counter
abc,10,eee,Yes,1,Yes,1,Yes,1,Yes,1
def,20,csc,Yes,1,No,1,Yes,1,No,1
abc,30,mec,Yes,2,No,1,No,1,No,1
ghi,40,sss,No,1,No,1,No,1,No,1
abc,10,eee,Yes,3,Yes,2,Yes,2,Yes,2
def,10,csc,Yes,2,Yes,3,Yes,2,No,1
搜索了类似案例并查找了uniq -c
命令,而!seen[$1]++
似乎只生成了唯一的值/行。请建议..
修改#1:
艾德莫顿,对于这篇糟糕的帖子感到抱歉,我已经编辑了这篇文章。亲切地检查一下。在实时场景中,我们从我们的供应商那里得到报价,如国家明智,区域明智,产品明智,产品代码明智,A-Z目的地的费率和成本信息 因此,我们无法确定需要删除哪些重复行,发布上述人口,我们可以检查并快速做出决定。例如,我试图检查字段$ 1中是否有任何重复信息。在名称字段下,“abc”出现三次,“def”出现两次,“ghi”出现一次。因此,如果任何单词不重复,则多次被视为“Name_Dup = No”,并且出现的计数是“Name_Counter = 1”(即ghi)
其中“abc”出现3次,因此当第一次出现计数为“Name_Dup = Yes”且Name_Counter = 1“时重复为”是“,当第二次出现”Name_Dup = Yes“且Name_Counter = 2时,当第三次出现“Name_Dup = Yes”而Name_Counter = 3
时然后需要检查$ 2,$ 3 ..直到$ NF和$ 0 ..
答案 0 :(得分:2)
awk 解决方案:
awk 'function hasDupe(arr, f){
return (arr[f]>1)? "Yes":"No"
}
BEGIN{ FS=OFS="," }
NR==1{ next }
NR==FNR{ names[$1]++; ages[$2]++; subs[$3]++; all[$0]++; next }
{
if (FNR==1)
print $0,"Name_Dup,Name_Counter,Age_Dup,Age_Counter,Sub_Dup,Sub_Counter,EntireLine_Dup,EntireLine_Counter";
else
print $0,hasDupe(names,$1),++n[$1],hasDupe(ages,$2),++a[$2],hasDupe(subs,$3),++s[$3],hasDupe(all,$0),++all_lines[$0]
}' file
输出:
Name,Age,Sub,Name_Dup,Name_Counter,Age_Dup,Age_Counter,Sub_Dup,Sub_Counter,EntireLine_Dup,EntireLine_Counter
abc,10,eee,Yes,1,Yes,1,Yes,1,Yes,1
def,20,csc,Yes,1,No,1,Yes,1,No,1
abc,30,mec,Yes,2,No,1,No,1,No,1
ghi,40,sss,No,1,No,1,No,1,No,1
abc,10,eee,Yes,3,Yes,2,Yes,2,Yes,2
def,10,csc,Yes,2,Yes,3,Yes,2,No,1