相同行之间的文本提取

时间:2011-09-15 18:39:47

标签: perl shell grep

我需要在行之间提取文本行并将其填充到excel文件中。线路数量之间存在差异,但它们始于 评论记录“idno”...其他文本的字符串

__DATA__ (This is what my .txt file looks like)
Comment for the record "id1"
Attempt1 made on [time] outcome [outcome]
note 1

Comment for the record "id2"
Attempt1 made on [time] outcome [outcome]
note 1
Attempt2 made on [time] outcome [outcome]
note 2

Comment for the record "id3"
Attempt1 made on [time] outcome [outcome]
note 1
Attempt2 made on [time] outcome [outcome]
note 2
Attempt3 made on [time] outcome [outcome]
note 3
Attempt4 made on [time] outcome [outcome]
note 4

希望将其显示为

id1     Attempt1   Note1 [outcome]
id2     Attempt1   Note1 [outcome]
id2     Attempt2   Note2 [outcome]
id3     Attempt1   Note1 [outcome]
id3     Attempt2   Note2 [outcome]
id3     Attempt3   Note3 [outcome]
id3     Attempt4   Note4 [outcome]

结果值将会改变,并且将是2-3位数字代码。

非常感谢任何帮助。我查看了这个网站的最后一天或者2,但是由于我的经验有限,我无法找到相关内容,而且我是一个相当新的perl和shell认为最好将其作为一个问题发布。

亲切的, 王牌

6 个答案:

答案 0 :(得分:2)

使用GNU awk(用于正则表达式捕获组)

gawk '
    /^$/ {next}
    match($0, /Comment for the record "([^"]*)/, a) {id = a[1]; next}
    match($0, /(.+) made on .* outcome (.+)/, a) {att = a[1]; out = a[2]; next}
    {printf("%s\t%s\t%s\t%s\n", id, att, $0, out)}
'

或者,转换为Perl:

perl -lne '
    chomp;
    next if /^$/;
    if (/Comment for the record "([^"]*)/) {$id = $1; next;}
    if (/(.+) made on .* outcome (.+)/) {$att = $1; $out = $2; next;}
    print join("\t", $id, $att, $_, $out);
'

答案 1 :(得分:2)

您的数据与段落导向的解析策略完全一致。由于您的规范含糊不清,很难确切地知道需要哪些正则表达式,但这应该说明一般方法:

use strict;
use warnings;

# Paragraph mode: read the input file a paragraph/block at a time.
local $/ = "";

while (my $block = <>){
    # Convert the block to lines.
    my @lines = grep /\S/, split("\n", $block);

    # Parse the text, capturing needing items from @lines as we consume it.
    # Note also the technique of assigning regex captures directly to variables.
    my ($id) = shift(@lines) =~ /"(.+)"/;
    while (@lines){
        my ($attempt, $outcome) = shift(@lines) =~ /(Attempt\d+).+outcome (\d+)/;
        my $note = shift @lines;
        print join("\t", $id, $attempt, $note, $outcome), "\n";
    }
}

答案 2 :(得分:1)

我认为你搜索的是这样的东西。它打印可以通过excel打开的CSV

use strict;

local $/;

block(/(id\d+)/,$_) for split /\n\n/, <DATA>;

sub block {
  my ($id,$block) = @_;

  $block =~ s/.*?(?=Attempt)//s;

  print join(',', $id, /(Attempt\d+)/, /([^\n]+)$/, /outcome (\d+)/)."\n"
    for split /(?=Attempt)/, $block
  ;
}

答案 3 :(得分:1)

除非我遗漏了什么,否则看起来非常直接:

  • 您要查找以Comment开头的行。这将包含您的ID。
  • 一旦你有了一个ID,你就会有一个Attempt行,后面是一个注释行。阅读尝试以及之后将包含注释的行。
  • 当你接下来的评论时,你会冲洗并重复。

我们有一个特定的结构:每个ID都有一个尝试的数组。每次尝试都将包含结果注意

我将在这里使用面向对象的Perl。我会将所有记录ID放入名为@dataList列表中,此列表中的每个项目都是Id类型。

每种类型Id将包含尝试的数组,每个尝试将具有 Id 时间结果注意

#! /usr/bin/perl
# test.pl

use strict;
use warnings;
use feature qw(say);

########################################################################
# READ IN AND PARSE YOUR DATA
#

my @dataList;

my $record;
while (my $line = <DATA>) {
    chomp $line;
    if ($line =~ /^Comment for the record "(.*)"/) {
        my $id = $1;
        $record = Id->new($id);
        push @dataList, $record;
    }
    elsif ($line =~ /^(\S+)\s+made on\s(\S+)\soutcome\s(.*)/) {
        my $attemptId = $1;
        my $time = $2;
        my $outcome = $3;

        # Next line is the note

        chomp (my $note = <DATA>);
        my $attempt = Attempt->new($attemptId, $time, $outcome, $note);
        $record->PushAttempt($attempt);
    }
}

foreach my $id (@dataList) {
    foreach my $attempt ($id->Attempt) {
        print $id->Id . "\t";
        print $attempt->Id . "\t";
        print $attempt->Note . "\t";
        print $attempt->Outcome . "\n";
    }
}
#
########################################################################


########################################################################
# PACKAGE Id;
#
package Id;
use Carp;

sub new {
    my $class       = shift;
    my $id  = shift;

    my $self = {};

    bless $self, $class;

    $self->Id($id);

    return $self;
}

sub Id {
    my $self = shift;
    my $id   = shift;

    if (defined $id) {
        $self->{ID} = $id;
    }

    return $self->{ID};
}

sub PushAttempt {
    my $self        = shift;
    my $attempt = shift;

    if (not defined $attempt) {
        croak qq(Missing Attempt in call to Id->PushAttempt);
    }
    if (not exists ${$self}{ATTEMPT}) {
        $self->{ATTEMPT} = [];
    }
    push @{$self->{ATTEMPT}}, $attempt;

    return $attempt;
}

sub PopAttempt {
    my $self = shift;

    return pop @{$self->{ATTEMPT}};
}

sub Attempt {
    my $self = shift;
    return @{$self->{ATTEMPT}};
}


#
########################################################################

########################################################################
# PACKAGE Attempt
#
package Attempt;

sub new {
    my $class       = shift;
    my $id  = shift;
    my $time        = shift;
    my $note        = shift;
    my $outcome = shift;

    my $self = {};
    bless $self, $class;

    $self->Id($id);
    $self->Time($time);
    $self->Note($note);
    $self->Outcome($outcome);

    return $self;
}

sub Id {
    my $self = shift;
    my $id   = shift;


    if (defined $id) {
        $self->{ID} = $id;
    }

    return $self->{ID};
}

sub Time {
    my $self = shift;
    my $time = shift;

    if (defined $time) {
        $self->{TIME} = $time;
    }

    return $self->{TIME};
}

sub Note {
    my $self = shift;
    my $note = shift;

    if (defined $note) {
        $self->{NOTE} = $note;
    }

    return $self->{NOTE};
}

sub Outcome {
    my $self        = shift;
    my $outcome = shift;

    if (defined $outcome) {
        $self->{OUTCOME} = $outcome;
    }

    return $self->{OUTCOME};
}
#
########################################################################

package main;

__DATA__
Comment for the record "id1"
Attempt1 made on [time] outcome [outcome11]
note 11

Comment for the record "id2"
Attempt21 made on [time] outcome [outcome21]
note 21
Attempt22 made on [time] outcome [outcome22]
note 22

Comment for the record "id3"
Attempt31 made on [time] outcome [outcome31]
note 31
Attempt32 made on [time] outcome [outcome32]
note 32
Attempt33 made on [time] outcome [outcome33]
note 33
Attempt34 made on [time] outcome [outcome34]
note 34

答案 4 :(得分:0)

这可能不太可靠,但这是sed

的有趣尝试
sed -r -n 's/Comment for the record "([^"]+)"$/\1/;tgo;bnormal;:go {h;n;};:normal /^Attempt[0-9]/{s/(.+) made on .* outcome (.+)$/\1 \2/;G;s/\n/ /;s/(.+) (.+) (.+)/\3\t\1\t\2/;N;s/\t([^\t]+)\n(.+)/\t\2\t\1/;p;d;}' data.txt

注意:仅限GNU sed。如果需要,便携性很容易实现。

答案 5 :(得分:0)

基于你的例子awk oneliner。

kent$  awk 'NF==5{gsub(/\"/,"",$5);id=$5;next;} /^Attempt/{n=$1;gsub(/Attempt/,"Note",n);print id,$1,n,$6}' input                      
id1 Attempt1 Note1 [outcome]
id2 Attempt1 Note1 [outcome]
id2 Attempt2 Note2 [outcome]
id3 Attempt1 Note1 [outcome]
id3 Attempt2 Note2 [outcome]
id3 Attempt3 Note3 [outcome]
id3 Attempt4 Note4 [outcome]