识别重复字段并根据条件分配随机ID

时间:2018-04-24 10:37:00

标签: unix awk sed grep

在一些unix东西的超级麻烦..这里的任何指导将非常感激。

我想根据下面文件中的 id 来识别重复记录,并在单独的col中为其分配一个唯一的随机数,并将其字段相加。 我的输入文件:

name,location,id,state,website,status,color,field1,value,field3,field4,field5
joe,US,23A,CA,g,oog,le,10,blue,0,10,0,0,0
jack,UK,89A,LN,yah,oo,11,red,0,20,0,0,0
joe,US,23A,CA,g,mail,10,blue,0,120,0,0,0
rose,EU,AV45,UN,new,mail,45,black,0,110,0,0,0
Karl,US,2345,NY,microsoft,99,green,0,34,0,0,0
jonas,IN,AW3455,ND,facebook,37,brown,0,48,0,0,0
Karl,US,2345,NY,microsoft,99,purple,0,87,0,0,0
alin,IN,3T45,CA,re,edit,78,white,0,22,0,0,0
alin,IN,3T45,CA,ora,cle,11,orange,0,35,0,0,0

我希望我的输出文件是:

RandonUniqID,ID,Value
2202,23A,130
3029,89A,20
3066,AV45,110
5077,2345,121
1055,AW3455,48
3099,3T45,57

这里我想为rec和那些重复的记录生成唯一的随机id,我想让他们的值字段在一个单独的字段中总结。这里最棘手的部分是我的第5列网站是非常动态的。该字段中的值将在任何地方放置逗号分隔符。所以这让我陷入困境。

2 个答案:

答案 0 :(得分:0)

试试这个:

awk -F ',' '
   NR>1{
      if( ! ( $3 in UID ) ) {

         # select a uniq random id 
         while( (Rnd=int(1000000*rand())) in UID) i++

         UID[$3]=Rnd
         }
      # workaround for 9th col where there are "," inside field
      S[$3]+=$(NF - 3)
      }
    END {
       print "RandonUniqID,ID,Value"
       for( uid in UID ) printf( "%s,%s,%s\n", UID[uid], uid, S[uid])
       }
    ' YourFile

我认为id少于1000000

答案 1 :(得分:0)

像这样:

awk '# Set the input and output field delimiter and print the headers
     BEGIN{FS=OFS=",";print "RandomID,ID,Value"}
     # iteratively calculate the s(um) per id ($3) on each row
     NR>1{s[$3]+=$(NF-3)}
     # Print the results, indexed by an integer r
     END{for(i in s){print r++,i,s[i]}}' input_file

NF是字段数,$(NF-3)是第4个字段。

这将生成如下的顺序ID:

RandomID,ID,Value
0,3T45,57
1,2345,121
2,23A,130
3,AV45,110
4,AW3455,48
5,89A,20

如果你需要4个字符宽的id,你可以使用printf

awk 'BEGIN{FS=",";print "RandomID,ID,Value"}
     NR>1{s[$3]+=$(NF-3)}
     END{for(i in s){printf "%04d,%s,%d\n",r++,i,s[i]}}' input_file

输出:

RandomID,ID,Value
0000,3T45,57
0001,2345,121
0002,23A,130
0003,AV45,110
0004,AW3455,48
0005,89A,20