Question

我想grep / sed文件来获取第一个匹配（模式1）到最后一个匹配（模式2）的所有行。例如：

[aaa] text1
[bbb] text1.5 <- first bbb
[aaa] text2
[bbb] text3
[bbb] text4
[bbb] text5
[zzz] text5.5
[ccc] text6
[ddd] text6.5
[ccc] text7 <- last ccc
[ddd] text8
[ddd] text9

模式1：bbb 模式2：ccc 输出：

[bbb] text1.5 <- first bbb
[aaa] text2
[bbb] text3
[bbb] text4
[bbb] text5
[zzz] text5.5
[ccc] text6
[ddd] text6.5
[ccc] text7 <- last ccc

我能够使用sed -n -e '/bbb/,/ccc/{ p; }'从第一场比赛（模式1）到第一场比赛（模式2）检索输出（不过“文本7”）。

编辑：我需要尽快使用此解决方案，因为它应该与大量（多GB）文件一起使用。

Answer 1

有人可能想出一个班轮，但我得到了这个：

#!/bin/bash
#
start=$(grep -n bbb data | head -1 | cut -d':' -f1)
end=$(grep -n ccc data | tail -1 | cut -d':' -f1)

sed -n "${start},${end}p" data

获取起始行，获取结束行，在这些数字之间打印。

Answer 2

你已经有了一个可行的sed解决方案。一个更有效率的＆＃34; sed解决方案需要将未知数量的内存用作缓冲区，这可能会有问题，具体取决于您的数据和系统。

另一种可能性是使用awk。以下内容适用于大多数版本的awk ...

awk 'NR==FNR && $1~/bbb/ && !a { a=NR } NR==FNR && $1~/ccc/ { b=NR } NR==FNR {next} FNR >= a && FNR <= b' file.txt file.txt

分发以便于阅读和评论

# If we're reading first file, and we see our start pattern,
# and we haven't seen it before, set "a" as our start record.
NR==FNR && $1~/bbb/ && !a { a=NR }

# If we're reading the first file, and we see our end pattern,
# set "b" as our end record.
NR==FNR && $1~/ccc/ { b=NR }

# If we're in the first file, move on to the next line.
NR==FNR {next}

# Now that we're in the second file...  If the current line is
# between (or inclusive of) our start/end records, print the line.
FNR >= a && FNR <= b

虽然这确实读取了两次文件，但它并没有将大量数据存储在内存中。

Answer 3

使用awk和缓冲区来保存ccc之间的行，如果两次ccc

之间存在巨大差距，则可能会遇到内存问题

$ awk 's{buf=buf?buf RS $0:$0; if(/ccc/){print buf; buf=""} next}
       /bbb/{f=1} f; /ccc/{s=1}' ip.txt
[bbb] text1.5 <- first bbb
[aaa] text2
[bbb] text3
[bbb] text4
[bbb] text5
[zzz] text5.5
[ccc] text6
[ddd] text6.5
[ccc] text7 <- last ccc

/bbb/{f=1} f; /ccc/{s=1}在第一次出现bbb和ccc之间打印行。它还会在第一次出现s

ccc

s

- buf=buf?buf RS $0:$0;在缓冲区中累积行
- if(/ccc/){print buf; buf=""}如果行包含ccc，则打印缓冲区内容然后将其清除
- next因为我们不需要其余的代码

也可以用

awk 'f || /bbb/{buf=buf?buf RS $0:$0; if(/ccc/){print buf; buf=""} f=1}' ip.txt

Answer 4

OP要求我发布我的Perl解决方案，以防它可以帮助其他人。

它只扫描输入文件一次。它确实需要 - 最大 - 磁盘空间是输入文件的两倍（输入文件+如果整个输入文件位于开始和结束标记之间的结果）。我决定使用磁盘缓冲，因为如果文件超大，内存可能不够大。

以下是代码：

#!/usr/bin/perl -w
#
################################################################################

use strict;

my($inputfile);
my($outputfile);
my($bufferfile) = "/tmp/bufferfile.tmp";
my($startpattern);
my($endpattern);

#################################################
# Subroutines
#################################################
sub show_usage
{
    print("Takes 4 arguments:\n");
    print("   1) the name of the file to process.\n");
    print("   2) the name of the output file.\n");
    print("   3) the start pattern.\n");
    print("   4) the end pattern.\n");
    exit;
}

sub close_outfiles
{
    close(OUTPUTFILE);
    close(BUFFERFILE);
}

sub cat_buffer_to_output
{
    # Open outputfile in append mode
    open(OUTPUTFILE,">>","$outputfile") or die "ERROR: could not open outputfile $outputfile (append mode)!";
    # Open bufferfile in read mode
    open(BUFFERFILE,"$bufferfile") or die "ERROR: could not open bufferfile $bufferfile (read mode)!";
    # Dump the content of the buffer to the output
    print OUTPUTFILE while <BUFFERFILE>;
    close_outfiles();
    # Reopen the bufferfile, with > to truncate it
    open(BUFFERFILE,">","$bufferfile") or die "ERROR: could not open bufferfile $bufferfile (write mode)!";
}

#################################################
# Main
#################################################

# Manage arguments
if (@ARGV != 4)
{
    show_usage();
}
else
{
    $inputfile = $ARGV[0];
    $outputfile = $ARGV[1];
    $startpattern = $ARGV[2];
    $endpattern = $ARGV[3];
}

# Open the files, the first time
open(INPUTFILE,"$inputfile") or die "ERROR: could not open inputfile $inputfile (read mode)!";
open(OUTPUTFILE,">","$outputfile") or die "ERROR: could not open outputfile $outputfile (write mode)!";
open(BUFFERFILE,">","$bufferfile") or die "ERROR: could not open bufferfile $bufferfile (write mode)!";

my($sendtobuffer) = 0;

while (<INPUTFILE>)
{
    # If I see the endpattern, empty the buffer file into the output file
    if ($_ =~ /$endpattern/)
    {
        print BUFFERFILE;
        cat_buffer_to_output();
    }
    else
    {
        # if sendtobuffer, the start pattern was seen at least once, print to BUFFERFILE
        if ($sendtobuffer)
        {
            print BUFFERFILE;
        }
        else
        {
            # if I see the start pattern, print to buffer and print future lines to buffer as well
            if ($_ =~ /$startpattern/)
            {
                print BUFFERFILE;
                $sendtobuffer = 1;
            }
        }
    }
}

# Close files
close(INPUTFILE);
close_outfiles();

# cleanup
unlink($bufferfile);

基本上它通读输入文件。当它第一次看到开始模式时，它开始将行写入缓冲文件。当看到结束模式时，它会将缓冲区文件的内容转储到输出文件中并截断缓冲区文件。由于它一直持续到文件结束，每次看到结束模式时，它都会将缓冲区文件转储到输出文件中。

Answer 5

对于与Sundeep答案相同的内存问题，您也可以使用此sed。

sed -n '/bbb/,/ccc/p;/ccc/!b;:A;N;/\n.*ccc/!bA;s/[^\n]*\n//;p;s/.*//;bA' infile

从第一场比赛（模式1）到最后一场比赛（模式2）的线

5 个答案: