删除日志文件中的一些行

时间:2016-08-25 14:38:53

标签: shell logging text text-processing

我有一个大日志文件。

删除每行的时间戳后,我按cat logfile | sort -u > logfile对其进行排序,以便日志清晰并按

组织
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
.
. (lines not shown here)
.
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
.
.
. (lines not shown here)
.
.
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format

我可以通过

获取已记录的项目(例如上例中的PL.HSPB
grep -oE " [0-9A-Z]*\.[0-9A-Z]*" logfile | sort -u

但是,我也想知道日期信息并使其更清晰,我想删除intermedia线。例如,

failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
删除后

failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format

即,对于一个项目,只保留第一行和最后一行(数字为年和朱利安日)。

是否有任何shell命令可以轻松实现此目的?

1 个答案:

答案 0 :(得分:0)

脚本:

$ cat hhz.py
#!/usr/bin/env python

import sys, re
from collections import OrderedDict

undateds = set()
firsts   = OrderedDict()
lasts    = OrderedDict()

while True:
  line = sys.stdin.readline()
  if line == '':
    break
  line = line.rstrip("\n")

  x = re.match("(.*HHZ\.)([0-9][0-9][0-9][0-9]\.[0-9]+)( .*)", line)
  if x is None:
    continue

  before = x.group(1)
  during = x.group(2)
  after  = x.group(3)
  undated = re.sub("(.*HHZ\.)[0-9][0-9][0-9][0-9]\.[0-9]+ (.*)", line, before+after)

  if not undated in firsts:
    firsts[undated] = line
  lasts[undated] = line

for undated in firsts:
  first = firsts[undated]
  last  = lasts[undated]
  print first
  if first != last:
    print last

输入:

$ cat hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format

输出:

$ hhz.py < hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else
  • 通过重新制作日期部分来分组。 undated是未命名的名称。
  • 如果尚未设置,则通过执行有序字典放入第一组。
  • 通过无条件地执行ordered-dict来获得最后一组。
  • 使用OrderedDict保留输入文件排序(如果您不想要,请使用dict
  • 检查first != last以避免在组中只有一个项目的情况下两次打印相同的内容