最初,该文件的内容如下:
1.2.3.4: 1,3,4
1.2.3.5: 9,8,7,6
1.2.3.4: 4,5,6
1.2.3.6: 1,1,1
在我尝试排序错误后,我有这个:
1.2.3.4: 1,3,4,4,5,6,
1.2.3.5: 9,8,7,6,
1.2.3.6: 1,1,1,
我想将其分为以下格式:
1.2.3.4: 1,3,4,5,6
1.2.3.5: 6,7,8,9
1.2.3.6: 1
但是如何访问每个元素中的每个逗号分隔字符并对它们进行排序,以便唯一升序删除重复项?到目前为止,我设法使用的唯一shell脚本只访问整个元素:
awk -F' ' 'NF>1{a[$1] = a[$1]$2","}END{for(i in a){print i" "a[i] | "sort -t: -k1 "}}' c.txt
答案 0 :(得分:3)
编辑:当原始数据尚未发布时,我第一次将中间数据作为输入,但当然也可以从原始数据中获取。再次使用GNU awk:
gawk -F '[ ,]' 'BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" } { for(i = 2; i <= NF; ++i) a[$1][$i]; } END { for(ip in a) { line = ip " "; for(n in a[ip]) { line = line n "," } sub(/,$/, "", line); print line } }' filename
代码的工作原理如下:
BEGIN {
PROCINFO["sorted_in"] = "@ind_num_asc" # GNU-specific: sorted array
# traversal
}
{
for(i = 2; i <= NF; ++i) a[$1][$i] # remember numbers by ip
}
END { # in the end:
for(ip in a) { # for all ips:
line = ip " " # construct the line: IP
for(n in a[ip]) { # numbers in order
line = line n ","
}
sub(/,$/, "", line) # remove trailing comma
print line # print the result.
}
}
使用GNU awk,假设数据的格式与问题中的格式完全相同(尾随,
):
gawk -F '[ ,]' 'BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" } { delete a; for(i = 2; i < NF; ++i) a[$i]; line = $1 " "; for(i in a) line = line i ","; sub(/,$/, "", line); print line; }' filename
文件内容按空格和逗号分隔,然后代码按如下方式工作:
BEGIN {
PROCINFO["sorted_in"] = "@ind_num_asc" # GNU-specific: sorted array
# traversal, numerically ascending
}
{
delete a
for(i = 2; i < NF; ++i) { a[$i] } # remember the fields in a line.
# duplicates are removed here.
# note that it's < NF instead of
# <= NF because the trailing comma
# leaves us with an empty last
# field.
line = $1 " " # start building line: IP field
for(i in a) { # append numbers separated by
line = line i "," # commas
}
sub(/,$/, "", line) # remove last trailing comma
print line # print result.
}