我有一个大日志文件。
删除每行的时间戳后,我按cat logfile | sort -u > logfile
对其进行排序,以便日志清晰并按
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
.
. (lines not shown here)
.
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
.
.
. (lines not shown here)
.
.
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
我可以通过
获取已记录的项目(例如上例中的PL.HSPB
)
grep -oE " [0-9A-Z]*\.[0-9A-Z]*" logfile | sort -u
但是,我也想知道日期信息并使其更清晰,我想删除intermedia线。例如,
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
删除后
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
即,对于一个项目,只保留第一行和最后一行(数字为年和朱利安日)。
是否有任何shell命令可以轻松实现此目的?
答案 0 :(得分:0)
脚本:
$ cat hhz.py
#!/usr/bin/env python
import sys, re
from collections import OrderedDict
undateds = set()
firsts = OrderedDict()
lasts = OrderedDict()
while True:
line = sys.stdin.readline()
if line == '':
break
line = line.rstrip("\n")
x = re.match("(.*HHZ\.)([0-9][0-9][0-9][0-9]\.[0-9]+)( .*)", line)
if x is None:
continue
before = x.group(1)
during = x.group(2)
after = x.group(3)
undated = re.sub("(.*HHZ\.)[0-9][0-9][0-9][0-9]\.[0-9]+ (.*)", line, before+after)
if not undated in firsts:
firsts[undated] = line
lasts[undated] = line
for undated in firsts:
first = firsts[undated]
last = lasts[undated]
print first
if first != last:
print last
输入:
$ cat hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
输出:
$ hhz.py < hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else
undated
是未命名的名称。OrderedDict
保留输入文件排序(如果您不想要,请使用dict
)first != last
以避免在组中只有一个项目的情况下两次打印相同的内容