Question

我是编码的初学者。我有一个脚本，该脚本将对文件中的特定单词进行grep。我需要计算它的出现次数，但是我正在寻找的单词是级联的。因此，如果我在不到1分钟的时间内再次重复，我想忽略其发生。但如果发生的设备与第一次出现的设备不同，则不应忽略
例如：file1.txt

file .txt    2018.09.06 21:27:45.001 There is a error 12345
file .txt    2018.09.06 21:27:45.009 error 12345 is reported on device-1
file .txt    2018.09.06 21:27:45.500 There is a error 12345
file .txt    2018.09.06 21:27:45.601 error 12345 is reported on device-1
file .txt    2018.09.06 21:27:46.899 There is a error 12345
file .txt    2018.09.06 21:27:46.905 error 12345 is reported on device-1
file .txt    2018.09.06 21:27:49.203 There is a error 12345
file .txt    2018.09.06 21:27:49.491 error 12345 is reported on device-6
file .txt    2018.09.06 21:27:52.703 There is a error 12345
file .txt    2018.09.06 21:29:52.991 error 12345 is reported on device-6

结果为

grep -c 12345 file1.txt
10

我得到的结果= 10

我需要的结果= 3

我如何忽略基于时间戳的重复出现。

Answer 1

您对“彼此之间1分钟之内”的部分有多大兴趣？如果说“忽略特定分钟内的多次出现”就足够了，那就很简单了。

首先获取所有“已报告错误xyz”行的列表

$ grep "error 12345 is reported" tfile.txt
file .txt    2018.09.06 21:27:45.009 error 12345 is reported on device-1
file .txt    2018.09.06 21:27:45.601 error 12345 is reported on device-1
file .txt    2018.09.06 21:27:46.905 error 12345 is reported on device-1
file .txt    2018.09.06 21:27:49.491 error 12345 is reported on device-6
file .txt    2018.09.06 21:29:52.991 error 12345 is reported on device-6

然后使用sed将其减少为HH:MM device-number格式

$ grep reported tfile.txt | sed 's/.*\(..:..\):.*reported on \(.*\)/\1 \2/'
21:27 device-1
21:27 device-1
21:27 device-1
21:27 device-6
21:29 device-6

然后找到唯一条目

$ grep reported tfile.txt | sed 's/.*\(..:..\):.*reported on \(.*\)/\1 \2/' | uniq
21:27 device-1
21:27 device-6
21:29 device-6

最后算一下

$ grep reported tfile.txt | sed 's/.*\(..:..\):.*reported on \(.*\)/\1 \2/' | uniq | wc -l
3

Answer 2

您需要解析时间戳，进行一些日期时间数学运算（总是有些困难，但是至少您可能没有越过时区，尽管您可能会定期越过夏令时，但我不这样做）认为您没有足够的信息可以告诉您何时发生）。这可能意味着一个简单的grep就不够用了，您需要阅读bash中的每一行，进行解析并跟踪。

然后，您将需要sed或awk之类的内容来解析该行。只要您仍然这样做，就可能只需要对整个过程使用awk。我从未使用过awk来进行时间戳管理，尽管我看到了它的手册页，所以它应该能够处理。其余的将根据时间戳跟踪设备名称，awk对此进行了很好的管理。

但是，我建议为此使用更高级的语言。无论是perl，python，ruby，它们都应该能够轻松应对。

计算一个字符串的实例，不包括时间戳小于1分钟的情况

2 个答案: