我有以下文件:
SOME TEXT AT START OF FILE
STRING1 SMALL
STRING2 SMALL
STRING1 MEDIUM
STRING3 LARGE
STRING2 XLG
SOME TEXT TO SEPARATE LISTS
STRING4 SMALL
STRING1 MEDIUM
STRING1 SMALL
STRING5 LARGE
STRING6 SMALL
SOME MORE TEXT TO SEPARATE LISTS
ANOTHER LIST
...
对于每个列表,我只想保留每个字符串中出现次数最多的(S,M,L,XL),以便结果看起来像这样:
SOME TEXT AT START OF FILE
STRING1 MEDIUM
STRING3 LARGE
STRING2 XLG
SOME TEXT TO SEPARATE LISTS
STRING4 SMALL
STRING1 MEDIUM
STRING5 LARGE
STRING6 SMALL
SOME MORE TEXT TO SEPARATE LISTS
ANOTHER LIST
...
我不知道该怎么做。请帮忙。我正在尝试通过Mac上的终端在bash脚本中执行此操作。
我还需要修改另一个类似的列表
TEXT
STRING1
STRING2
STRING3
STRING1
TEXT
STRING4
STRING1
TEXT
STRING5
STRING2
STRING5
ETC...
在这种情况下,如何消除重复的字符串?我打算尝试使用awk '!seen[$0]++' filename
,但是这会从每个列表中删除字符串,而不是分别查看每个列表。
答案 0 :(得分:1)
第一个问题
$ cat tst.awk
BEGIN {
sz["SMALL"] = 0
sz["MEDIUM"] = 1
sz["LARGE"] = 2
sz["XLG"] = 3
}
/^[^ ]/ {
dump()
delete data
print
next
}
!($1 in data) || sz[data[$1]] < sz[$2] {
data[$1] = $2
}
END {
dump()
}
function dump(k) {
for (k in data)
print " " k " " data[k]
}
$
$ awk -f tst.awk file
SOME TEXT AT START OF FILE
STRING1 MEDIUM
STRING2 XLG
STRING3 LARGE
SOME TEXT TO SEPARATE LISTS
STRING4 SMALL
STRING5 LARGE
STRING6 SMALL
STRING1 MEDIUM
SOME MORE TEXT TO SEPARATE LISTS
ANOTHER LIST
...
第二个
awk '/^[^ ]/{delete seen}!seen[$0]++' file