awk检查重复和增量计数

时间:2017-09-22 15:42:18

标签: unix awk

想检查重复的每个field级别,如果重复,则填充"Yes" "No"。 然后填充该字段的counter的增量计数。然后检查整行是重复的还是唯一的。

Input.csv

Name,Age,Sub
abc,10,eee
def,20,csc
abc,30,mec
ghi,40,sss
abc,10,eee
def,10,csc

期望的输出:

Name,Age,Sub,Name_Dup,Name_Counter,Age_Dup,Age_Counter,Sub_Dup,Sub_Counter,EntireLine_Dup,EntireLine_Counter
abc,10,eee,Yes,1,Yes,1,Yes,1,Yes,1
def,20,csc,Yes,1,No,1,Yes,1,No,1
abc,30,mec,Yes,2,No,1,No,1,No,1
ghi,40,sss,No,1,No,1,No,1,No,1
abc,10,eee,Yes,3,Yes,2,Yes,2,Yes,2
def,10,csc,Yes,2,Yes,3,Yes,2,No,1

搜索了类似案例并查找了uniq -c命令,而!seen[$1]++似乎只生成了唯一的值/行。请建议..

修改#1:

艾德莫顿,对于这篇糟糕的帖子感到抱歉,我已经编辑了这篇文章。亲切地检查一下。在实时场景中,我们从我们的供应商那里得到报价,如国家明智,区域明智,产品明智,产品代码明智,A-Z目的地的费率和成本信息  因此,我们无法确定需要删除哪些重复行,发布上述人口,我们可以检查并快速做出决定。

例如,我试图检查字段$ 1中是否有任何重复信息。在名称字段下,“abc”出现三次,“def”出现两次,“ghi”出现一次。因此,如果任何单词不重复,则多次被视为“Name_Dup = No”,并且出现的计数是“Name_Counter = 1”(即ghi)

其中“abc”出现3次,因此当第一次出现计数为“Name_Dup = Yes”且Name_Counter = 1“时重复为”是“,当第二次出现”Name_Dup = Yes“且Name_Counter = 2时,当第三次出现“Name_Dup = Yes”而Name_Counter = 3

然后需要检查$ 2,$ 3 ..直到$ NF和$ 0 ..

1 个答案:

答案 0 :(得分:2)

awk 解决方案:

awk 'function hasDupe(arr, f){ 
         return (arr[f]>1)? "Yes":"No" 
     }
     BEGIN{ FS=OFS="," }
     NR==1{ next }
     NR==FNR{ names[$1]++; ages[$2]++; subs[$3]++; all[$0]++; next }
     {
         if (FNR==1) 
             print $0,"Name_Dup,Name_Counter,Age_Dup,Age_Counter,Sub_Dup,Sub_Counter,EntireLine_Dup,EntireLine_Counter"; 
         else
             print $0,hasDupe(names,$1),++n[$1],hasDupe(ages,$2),++a[$2],hasDupe(subs,$3),++s[$3],hasDupe(all,$0),++all_lines[$0] 
     }' file

输出:

Name,Age,Sub,Name_Dup,Name_Counter,Age_Dup,Age_Counter,Sub_Dup,Sub_Counter,EntireLine_Dup,EntireLine_Counter
abc,10,eee,Yes,1,Yes,1,Yes,1,Yes,1
def,20,csc,Yes,1,No,1,Yes,1,No,1
abc,30,mec,Yes,2,No,1,No,1,No,1
ghi,40,sss,No,1,No,1,No,1,No,1
abc,10,eee,Yes,3,Yes,2,Yes,2,Yes,2
def,10,csc,Yes,2,Yes,3,Yes,2,No,1