在一些unix东西的超级麻烦..这里的任何指导将非常感激。
我想根据下面文件中的 id 来识别重复记录,并在单独的col中为其分配一个唯一的随机数,并将其值字段相加。 我的输入文件:
name,location,id,state,website,status,color,field1,value,field3,field4,field5
joe,US,23A,CA,g,oog,le,10,blue,0,10,0,0,0
jack,UK,89A,LN,yah,oo,11,red,0,20,0,0,0
joe,US,23A,CA,g,mail,10,blue,0,120,0,0,0
rose,EU,AV45,UN,new,mail,45,black,0,110,0,0,0
Karl,US,2345,NY,microsoft,99,green,0,34,0,0,0
jonas,IN,AW3455,ND,facebook,37,brown,0,48,0,0,0
Karl,US,2345,NY,microsoft,99,purple,0,87,0,0,0
alin,IN,3T45,CA,re,edit,78,white,0,22,0,0,0
alin,IN,3T45,CA,ora,cle,11,orange,0,35,0,0,0
我希望我的输出文件是:
RandonUniqID,ID,Value
2202,23A,130
3029,89A,20
3066,AV45,110
5077,2345,121
1055,AW3455,48
3099,3T45,57
这里我想为rec和那些重复的记录生成唯一的随机id,我想让他们的值字段在一个单独的字段中总结。这里最棘手的部分是我的第5列网站是非常动态的。该字段中的值将在任何地方放置逗号分隔符。所以这让我陷入困境。
答案 0 :(得分:0)
试试这个:
awk -F ',' '
NR>1{
if( ! ( $3 in UID ) ) {
# select a uniq random id
while( (Rnd=int(1000000*rand())) in UID) i++
UID[$3]=Rnd
}
# workaround for 9th col where there are "," inside field
S[$3]+=$(NF - 3)
}
END {
print "RandonUniqID,ID,Value"
for( uid in UID ) printf( "%s,%s,%s\n", UID[uid], uid, S[uid])
}
' YourFile
我认为id少于1000000
答案 1 :(得分:0)
像这样:
awk '# Set the input and output field delimiter and print the headers
BEGIN{FS=OFS=",";print "RandomID,ID,Value"}
# iteratively calculate the s(um) per id ($3) on each row
NR>1{s[$3]+=$(NF-3)}
# Print the results, indexed by an integer r
END{for(i in s){print r++,i,s[i]}}' input_file
NF
是字段数,$(NF-3)
是第4个字段。
这将生成如下的顺序ID:
RandomID,ID,Value
0,3T45,57
1,2345,121
2,23A,130
3,AV45,110
4,AW3455,48
5,89A,20
如果你需要4个字符宽的id,你可以使用printf
:
awk 'BEGIN{FS=",";print "RandomID,ID,Value"}
NR>1{s[$3]+=$(NF-3)}
END{for(i in s){printf "%04d,%s,%d\n",r++,i,s[i]}}' input_file
输出:
RandomID,ID,Value
0000,3T45,57
0001,2345,121
0002,23A,130
0003,AV45,110
0004,AW3455,48
0005,89A,20