Question

我有一个.txt文件，其中会生成许多Snort警报。我想搜索此文件并删除重复的警报，并只保留其中一个。到目前为止我使用以下代码：

with open('SnortReportFinal', 'r') as f:
    file_lines = f.readlines()

cont_lines = []
for line in range(len(file_lines)):
        if re.search('\d:\d+:\d+', file_lines[line]):
        cont_lines.append(line)

for idx in cont_lines[1:]: # skip one instance of the string
    file_lines[idx] = "" # replace all others

with open('SnortReportFinal', 'w') as f:
    f.writelines(file_lines)

正则表达式匹配我正在搜索的字符串，即1：234：5，如果它找到相同字符串的多个实例，我希望它删除它们并且只保留一个。这不起作用，因为所有其他字符串都被删除，并且它只保留表达式匹配的一个字符串。

文件包含如下文字：

[1:368:6] ICMP PING BSDtype [**]
[1:368:6] ICMP PING BSDtype [**]
[1:368:6] ICMP PING BSDtype [**]
[1:368:6] ICMP PING BSDtype [**]

部分[1：368：6]可以是数字的变体，即[1：5476：5]。

我希望我的预期输出仅为：

[1:368:6] ICMP PING BSDtype [**]
[1:563:2] ICMP PING BSDtype [**]

剩下的字符串被删除，休息时我的意思是数字的差异很好，但不是重复的数字。

Answer 1

看起来你真的不需要正则表达式。要简单地删除重复项：

alerts = set(f.readlines())

这会将文件中的行列表转换为一个集合，从而删除重复项。从这里，您可以直接将设置写回文本文件。

或者，您可以直接调用文件对象上的set，就像Padraic Cunningham在评论中指出的那样：

alerts = set(f)

Answer 2

您不需要regex即可使用set：

seen=set(i.strip() for i in open('infile.txt'))

示例：

>>> s="""[1:368:6] ICMP PING BSDtype [**]
... [1:368:6] ICMP PING BSDtype [**]
... [1:368:6] ICMP PING BSDtype [**]
... [1:368:6] ICMP PING BSDtype [**]
... [1:563:2] ICMP PING BSDtype [**]"""
>>> set(s.split('\n'))
set(['[1:563:2] ICMP PING BSDtype [**]', '[1:368:6] ICMP PING BSDtype [**]'])

使用正则表达式删除python的重复字符串

2 个答案: