假设我有一个如下所示的日志文件:
06/30/2015 00:17:20.716 INFO 06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.735 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759 INFO 06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827 INFO 06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855 INFO 06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861 INFO 06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873 INFO 06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.902 INFO 06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970 INFO 06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991 INFO 06z07ngwMW16zz Matched Line
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.085 INFO 06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094 INFO 06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123 INFO 06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132 INFO 06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
我希望做的是:
"Matched Line"
,我需要在第4列中获取唯一ID(例如06z07mjBYxFpzs
),"Some Data xxyyzz"
和"Some Data xxyyzz"
)的行作为最终输出。所以在这种情况下输出应该是:
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
我在这里讨论的文件是一个巨大的文件(大约200 GB文件;有数百万条记录),在共享服务器上,所以我不能运行需要花费大量资源的脚本或命令时间。
[编辑] - 目前通过在一个文件中打印Matched Line
的唯一ID并在其他文件中打印Some Data xxyyzz
来完成fgrep;但是要查找单行grep
,awk
或sed
命令(无需创建多个文件到fgrep
)
[编辑2] - 此输出不在文件中,而是一系列grep
和sort
的中间输出。
[编辑3] - 更新了样本输入(不按顺序但混乱):
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:20.716 INFO 06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.735 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759 INFO 06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827 INFO 06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855 INFO 06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861 INFO 06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873 INFO 06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.902 INFO 06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970 INFO 06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991 INFO 06z07ngwMW16zz Matched Line
06/30/2015 00:17:21.085 INFO 06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094 INFO 06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123 INFO 06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132 INFO 06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
答案 0 :(得分:3)
以下内容只进行一次文件,因此应该很快:
$ awk '/Matched Line/{id=$4;next;} id==$4' file.log
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
在示例输入(原始问题)中,所有Some Data
行都紧跟在Matched Line
之后。这可以实现这种快速简单的解决方案。
awk在管道中运行良好。如果输入不是来自文件,而是来自管道中的编辑2 ,那么请使用以下内容:
cmd1 <file.log | cmd2 | awk '/Matched Line/{id=$4;next;} id==$4' | cmd3
/Matched Line/{id=$4;next;}
每当我们找到包含文字Matched Line
的行时,我们会将其ID保存在变量id
中。由于我们不想打印Matched Line
,我们告诉awk跳过其余命令并跳转到next
行。
id==$4
如果当前行的ID(字段4)与我们保存的id
匹配,则我们会打印该行。
(在awk术语中,id==$4
是一个条件:它的计算结果为true或false。当条件为真时,执行动作。在这种情况下,我们没有指定任何动作,所以awk执行默认动作是印刷线。)
在编辑3 中,数据行可以出现在匹配行之后的某个随机位置。在那种情况下:
$ awk '/Matched Line/{id[$4]=1;next;} id[$4]' file.log
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
或者,在管道中:
cmd1 file.log | awk '/Matched Line/{id[$4]=1;next;} id[$4]'
答案 1 :(得分:1)
grep "Matched Line" data.txt | awk '{print $4}' | xargs -l1 -i grep {} data.txt | grep -v "Matched Line"
输出:
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
或者,使用bash的进程替换,我们可以减少文件data.txt
必须被读取的次数:
grep -f <(grep "Matched Line" data.txt | awk '{print $4}') data.txt | grep -v "Matched Line"