Bash脚本用于过滤日志

时间:2016-05-31 01:49:51

标签: linux bash logging awk scripting

我尝试创建一个脚本来过滤掉日志中的重复项并保留每条消息的最新信息。样本将在下面;

May 29 22:25:19 servername.com Fdm: this is error message 1 error code=0x98765
May 29 22:25:19 servername.com Fdm: this is just a message
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
May 29 22:25:20 servername.com Vpxa: just another message
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:30 servername.com Fdm: another error message 3 76543

日志分为两个文件,我已经开始创建脚本来合并这两个文件,并使用sort -s -r -k1按日期对文件进行排序。

我还设法创建了脚本,以便它询问我想要的日期,然后使用grep按日期过滤掉。

现在,我只需要找到一种方法来删除不相邻的重复行,这些行也有不同的时间戳。我试过awk但是,我对awk的了解并不是那么好。那里的任何一位专家都能帮助我吗?

PS,我遇到的问题之一是有相同的行有不同的错误代码,我想删除那些行但是,我只能通过grep -v"常数部分线&#34 ;.如果我有办法按相似百分比删除重复项,那就太棒了。此外,我无法让脚本忽略某些字段或列,因为在不同的字段/列中有错误代码行。

预期输出如下;

May 29 22:25:30 servername.com Fdm: another error message 3 76543
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890

我只想要错误,但是这很容易用grep -i错误完成。唯一的问题是具有不同错误代码的重复行。

5 个答案:

答案 0 :(得分:1)

要删除具有不同时间戳的相同行,您只需检查第15个字符后的重复行。

awk '!duplicates[substr($0,15)]++' $filename

如果您的日志以制表符分隔,您可以更精确地选择要确定重复的列,这比尝试查找不同文件之间的Levenshtein距离更好。

答案 1 :(得分:1)

您可以单独使用sort执行此操作。

只需操作从4开始的字段即可获得重复项:

sort -uk4 file.txt

这将为您提供首次参赛作品;如果您希望最后一个事先使用tac

tac file.txt | sort -uk4 

示例:

$ cat file.txt      
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:19 servername.com Fdm: [FFB03B90 verbose 'Invt' opID=SWI-65391264] [DsStateChange::SaveToInventory] Processing locked error update for /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050 (<unset>) from __localhost__
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaMoVm' opID=SWI-54ad408b] [VpxaMoVm::CheckMoVm] did not find a VM with ID 17 in the vmList
May 21 12:05:02 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaAlarm' opID=SWI-54ad408b] [VpxaAlarm] VM with vmid = 17 not found
May 30 07:50:07 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores

$ sort -uk4 file.txt
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:19 servername.com Fdm: [FFB03B90 verbose 'Invt' opID=SWI-65391264] [DsStateChange::SaveToInventory] Processing locked error update for /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050 (<unset>) from __localhost__
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaAlarm' opID=SWI-54ad408b] [VpxaAlarm] VM with vmid = 17 not found
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaMoVm' opID=SWI-54ad408b] [VpxaMoVm::CheckMoVm] did not find a VM with ID 17 in the vmList

$ tac file.txt | sort -uk4         
May 30 07:50:07 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
May 21 12:05:02 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:19 servername.com Fdm: [FFB03B90 verbose 'Invt' opID=SWI-65391264] [DsStateChange::SaveToInventory] Processing locked error update for /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050 (<unset>) from __localhost__
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaAlarm' opID=SWI-54ad408b] [VpxaAlarm] VM with vmid = 17 not found
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaMoVm' opID=SWI-54ad408b] [VpxaMoVm::CheckMoVm] did not find a VM with ID 17 in the vmList

答案 2 :(得分:1)

您可以始终跳过前3个字段,并使用sort -suk4删除重复项。前3个字段将是日期字符串,因此将删除之后具有相同文本的任何两行。然后,您可以根据需要对输出

进行排序
sort -suk4 filename | sort -rs

删除具有不同错误代码的行会比较棘手,但我建议将带有错误代码的行隔离到自己的文件中,然后使用以下内容:

sed 's/\(.*error code=\)\([0-9]*\)/\2 \1/' errorfile | sort -suk5 | sed 's/\([0-9]*\) \(.*error code=\)/\2\1/'

答案 3 :(得分:0)

您没有告诉我们您是如何定义“重复”的,但如果您指的是同一天的消息,那么就可以这样做:

$ tac file | awk '!seen[$1,$2,$3]++' | tac
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
May 29 22:25:20 servername.com Vpxa: just another message
May 29 22:25:30 servername.com Fdm: another error message 3 76543

如果这不是您的意思,那么只需将awk数组中使用的索引更改为您想要考虑重复测试的任何内容。

鉴于你最近的评论,这可能是你想要的:

$ tac file | awk '!/error/{next} {k=$0; sub(/([^:]+:){3}/,"",k); gsub(/[0-9]+/,"#",k)} !seen[k]++' | tac
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:30 servername.com Fdm: another error message 3 76543

以上工作原理是创建一个键值k,它是第一个:之后不属于时间字段的部分,所有数字序列都变为#

$ awk '!/error/{next} {k=$0; sub(/([^:]+:){3}/,"",k); gsub(/[0-9]+/,"#",k); print $0 ORS "\t -> key =", k}' file
May 29 22:25:19 servername.com Fdm: this is error message 1 error code=0x98765
         -> key =  this is error message # error code=#x#
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
         -> key =  error code=# message #
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
         -> key =  this is error message # error code=#x#
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
         -> key =  error code=# message #
May 29 22:25:30 servername.com Fdm: another error message 3 76543
         -> key =  another error message # #

答案 4 :(得分:0)

我设法找到了办法。只是为了给你们更多关于我所遇到的问题和这个脚本的详细信息。

问题:我有必须清除的日志但是,日志有多行重复错误。不幸的是,重复错误有不同的错误代码,所以,我不能只是grep -v它们。此外,日志有数万行,所以,为了保持&#34; grep -v&#34; - 它们会消耗大量的时间,所以,我决定使用脚本对其进行半自动化。下面是脚本。如果您对如何改进脚本有想法,请发表评论!

#!/usr/local/bin/bash

rm /tmp/tmp.log /tmp/tmpfiltered.log 2> /dev/null

printf "Please key in full location of logs: "

read log1loc log2loc

cat $log1loc $log2loc >> /tmp/tmp.log

sort -s -r -k1 /tmp/tmp.log -o /tmp/tmp.log

printf "Please key in the date: "

read logdate

while [[ $firstlineedit != "n" ]]

        do

        grep -e "$logdate" /tmp/tmp.log | grep -i error | less

        firstline=$(head -n 1 /tmp/tmp.log)

        head -n 1 /tmp/tmp.log >> /tmp/tmpfiltered.log

        read -p "Enter line to remove(enter n to quit): " -e -i "$firstline" firstlineedit

        firstlinecount=$(grep -e "$logdate" /tmp/tmp.log | grep -i error | grep -o "$firstlineedit" | wc -l)

        grep -e "$logdate" /tmp/tmp.log | grep -i error | grep -v "$firstlineedit" > /tmp/tmp2.log

        mv /tmp/tmp2.log /tmp/tmp.log

        if [ "$firstlineedit" != "n" ];

                then

                echo That line and it"'"s variations have appeared $firstlinecount times in the log!

        fi
done

cat /tmp/tmpfiltered.log | less