awk:删除重复的行,单词位置可交换

时间:2017-03-10 23:25:05

标签: awk

一般,我的问题是如何使用AWK如何删除文件中的重复行,其中“重复”包括某些列可交换的情况。

我的问题的

背景。最初我有一个这样的文件:

10/13-01:55:42.549318  [**] [1:1000003:0] Detect possible CnC comu [**] [Classification: Misc activity] [Priority: 3] {TCP} 10.0.0.3:1045 -> 103.105.0.1:80
10/13-01:55:42.549318  [**] [1:1000003:0] Detect possible CnC comu [**] [Classification: Misc activity] [Priority: 3] {TCP} 103.105.0.1:80 -> 10.0.0.3:1045
10/13-01:56:45.221877  [**] [1:1000003:0] Detect possible CnC comu [**] [Classification: Misc activity] [Priority: 3] {TCP} 10.0.0.3:1049 -> 103.105.0.1:80
10/13-01:56:57.150985  [**] [1:1000003:0] Detect possible CnC comu [**] [Classification: Misc activity] [Priority: 3] {TCP} 10.0.0.3:1051 -> 103.105.0.1:80
10/13-01:56:58.935176  [**] [1:1000003:0] Detect possible CnC comu [**] [Classification: Misc activity] [Priority: 3] {TCP} 10.0.0.3:1051 -> 103.105.0.1:80
10/13-01:57:13.494148  [**] [1:1000003:0] Detect possible CnC comu [**] [Classification: Misc activity] [Priority: 3] {TCP} 10.0.0.3:1054 -> 103.105.0.1:80

我的目标是达到以下格式化文件:

10.0.0.3|1045|103.105.0.1|80|CnC
10.0.0.3|1049|103.105.0.1|80|CnC
10.0.0.3|1051|103.105.0.1|80|CnC
10.0.0.3|1054|103.105.0.1|80|CnC

到目前为止的努力和进展我使用以下内容(编写得非常糟糕)来处理它:

cat test.log | awk -F" " '{print $6 " " $15 " " $17}' | awk '{t = $1; $1 = $2; $2 = $3; $3 = t; print;}' | awk '{gsub(":", "| "); gsub(" ","|"); print}' | awk 'NR%2!=0'

然后我有一个包含以下示例的文件:

10.0.0.3|1045|103.105.0.1|80|CnC
10.0.0.3|1049|103.105.0.1|80|CnC
10.0.0.3|1051|103.105.0.1|80|CnC
10.0.0.3|1051|103.105.0.1|80|CnC
10.0.0.3|1054|103.105.0.1|80|CnC
103.105.0.1|80|10.0.0.3|1045|CnC

第一行和最后一行被视为重复,因为它们符合以下模式

A|a|B|b|M
B|b|A|a|M

寻求帮助我想知道无论如何使用AWK我可以在原始格式的相对较大的文件中删除这些重复的行而不需要我的后期处理吗?谢谢!

1 个答案:

答案 0 :(得分:0)

也许您可以完全跳过此步骤,只需处理原始数据:

#!/usr/bin/awk -f

BEGIN{ OFS = "|" }

{
    ip1 = $(NF-2)
    ip2 = $NF
}

!(key1[ip1,ip2] + key1[ip2,ip1]){

    split(ip1,combo1,":")
    split(ip2,combo2,":")

    key1[ip1,ip2]++
    key1[ip2,ip1]++

    print combo1[1],combo1[2],combo2[1],combo2[2],$6
}