GREP用于文件中的动态模式,并打印具有前一个模式和另一个模式的其他行

时间:2015-07-21 06:06:00

标签: regex awk sed grep text-extraction

假设我有一个如下所示的日志文件:

06/30/2015 00:17:20.716  INFO   06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.735  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759  INFO   06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827  INFO   06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855  INFO   06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861  INFO   06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873  INFO   06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.902  INFO   06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970  INFO   06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991  INFO   06z07ngwMW16zz Matched Line
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.085  INFO   06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094  INFO   06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123  INFO   06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132  INFO   06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

我希望做的是:

  • 任何一行都包含"Matched Line",我需要在第4列中获取唯一ID(例如06z07mjBYxFpzs),
  • 搜索具有唯一ID +文本"Some Data xxyyzz"
  • 的其他行
  • 在控制台上打印具有匹配模式(唯一ID + "Some Data xxyyzz")的行作为最终输出。

所以在这种情况下输出应该是:

06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

我在这里讨论的文件是一个巨大的文件(大约200 GB文件;有数百万条记录),在共享服务器上,所以我不能运行需要花费大量资源的脚本或命令时间。

[编辑] - 目前通过在一个文件中打印Matched Line的唯一ID并在其他文件中打印Some Data xxyyzz来完成fgrep;但是要查找单行grepawksed命令(无需创建多个文件到fgrep

[编辑2] - 此输出不在文件中,而是一系列grepsort的中间输出。

[编辑3] - 更新了样本输入(不按顺序但混乱):

06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:20.716  INFO   06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.735  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759  INFO   06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827  INFO   06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855  INFO   06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861  INFO   06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873  INFO   06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.902  INFO   06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970  INFO   06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991  INFO   06z07ngwMW16zz Matched Line
06/30/2015 00:17:21.085  INFO   06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094  INFO   06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123  INFO   06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132  INFO   06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

2 个答案:

答案 0 :(得分:3)

有序数据

以下内容只进行一次文件,因此应该很快:

$ awk '/Matched Line/{id=$4;next;} id==$4' file.log
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz

在示例输入(原始问题)中,所有Some Data行都紧跟在Matched Line之后。这可以实现这种快速简单的解决方案。

如何在管道中使用

awk在管道中运行良好。如果输入不是来自文件,而是来自管道中的编辑2 ,那么请使用以下内容:

cmd1 <file.log | cmd2 | awk '/Matched Line/{id=$4;next;} id==$4' | cmd3

如何运作

  • /Matched Line/{id=$4;next;}

    每当我们找到包含文字Matched Line的行时,我们会将其ID保存在变量id中。由于我们不想打印Matched Line,我们告诉awk跳过其余命令并跳转到next行。

  • id==$4

    如果当前行的ID(字段4)与我们保存的id匹配,则我们会打印该行。

    (在awk术语中,id==$4是一个条件:它的计算结果为true或false。当条件为真时,执行动作。在这种情况下,我们没有指定任何动作,所以awk执行默认动作是印刷线。)

部分订购数据

编辑3 中,数据行可以出现在匹配行之后的某个随机位置。在那种情况下:

$ awk '/Matched Line/{id[$4]=1;next;} id[$4]' file.log
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz 

或者,在管道中:

cmd1 file.log | awk '/Matched Line/{id[$4]=1;next;} id[$4]'

答案 1 :(得分:1)

grep "Matched Line" data.txt  | awk '{print $4}' | xargs -l1 -i grep {} data.txt | grep -v "Matched Line"
  1. 搜索所有&#34;匹配行&#34;
  2. 打印到stdout行中的第4个元素
  3. 对于输出中的每一行,运行grep:搜索打印的id
  4. 再次搜索,但没有&#34;匹配行&#34;
  5. 输出:

    06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
    06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
    06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
    06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
    06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
    06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
    06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
    06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz 
    

    或者,使用bash的进程替换,我们可以减少文件data.txt必须被读取的次数:

    grep -f <(grep "Matched Line" data.txt  | awk '{print $4}') data.txt | grep -v "Matched Line"