Perl和解析凌乱的文本

时间:2010-11-05 22:21:03

标签: perl parsing unpack

我有以下文字

                         Instructor First                          Number Students Who   Number Students Who
Subject Course Section                      Instructor Last Name                                               A    B C       D F
                         Name                                      Completed the Class   Dropped the Class
ACCT    201    01        Karin              Hatheway Dial          56                    6                     19   9    16   2   5
ACCT    202    01        Karin              Hatheway Dial          69                    11                    37   14   7    2   6
ACCT    205    01        Darryl             Woolley                20                    1                     3    7    6    1   3
ACCT    205    02        Darryl             Woolley                28                    1                     6    7    13       2
ACCT    205    03        Darryl             Woolley                42                    5                     4    13   21   1   3
ACCT    205    04        Darryl             Woolley                23                    1                     9    5    8    1
ACCT    205    05        Darryl             Woolley                30                    2                     11   7    9    2   1
ACCT    205    06        Darryl             Woolley                25                    3                     8    9    6    1   1
ACCT    275    01        Darryl             Woolley                33                    2                     7    15   9    1   1
ACCT    310    01        Marla              Kraut                  16                    1                     1    6    7    2
ACCT    310    02        Marla              Kraut                  64                                          5    43   15   1
ACCT    310    03        Marla              Kraut                  72                    3                     11   47   10   3   1
ACCT    311    01        Karin              Hatheway Dial          45                                          13   20   11   1
ACCT    311    02        Karin              Hatheway Dial          25                                          10   12   3
ACCT    315    01        Jason              Porter                 26                                          6    5    8    6   1
ACCT    315    02        Jason              Porter                 29                    1                     6    10   5    7   1
ACCT    414    01        Teresa             Gordon                 22                    1                     6    6    9    1
ACCT    483    01        Glen               Utzman                 26                    1                     7    13   6
ACCT    486    01        Teresa             Gordon                 33                                          13   14   6
ACCT    492    01        Jason              Wills                  23                                          5    8    9    1
ACCT    515    01        Jeffrey            Harkins                15                                          7    6    1
ACCT    561    01        Jason              Porter                 18                    1                     10   7    1
ADOL    526    13        Charles            Gagel                  21                    2                     19   1             1
ADOL    573    13        Martha             Yopp                   28                                          16   3             1
ADOL    574    01        Laura              Holyoke                16                                          12   3             1
ADOL    574    11        Laura              Holyoke                9                     1                     8    1
ADOL    574    13        Laura              Holyoke                15                                          10   4             1
ADOL    600    13        Roger              Scott                  19                                          4         1
AERO    101    01        William            Beauter                11                                          8    2    1
AERO    103    01        Sarah              Babbitt                15                                          7    6    1        1
AERO    411    01        Sarah              Babbitt                11                                          6    4    1
AERO    413    01        Sarah              Babbitt                12                                          8    3    1
AGEC   101   01   Larry         Van Tassell   36    1    20   15        1
AGEC   278   01   Larry         Makus         21    1    2    6    8    5
AGEC   278   02   Larry         Makus         18         5    10   2    1
AGEC   278   03   Larry         Makus         17    1    2    7    5    2    1
AGEC   301   01   Christopher   McIntosh      18         9    4    5
AGEC   356   01   Joseph        Guenthner     23         15   6    2
AGEC   361   01   Ruby          Stroschein    11         4    1    6
AGEC   411   01   Robert        Haggerty      11         6    4    1
AGEC   413   01   Robert        Spear         12    3    4    5    2    1
AGEC   415   01   Larry         Van Tassell   11         10   1
AGEC   526   01   Scott         Matulich      7          2    5
AGEC   527   01   Stephen       Cooke         5          3    2
AGED   180   01   Lori          Moore         23    1    14   5    1    3
AGED   351   01   Lou           Riesenberg    11         4    6    1
AMST   301   01   Walter        Hesford       26         14   8    3         1
ANTH   100   01   Mark          Warner        104   15   31   31   21   8    12
ANTH   220   01   Fumiyasu      Arakawa       138   4    48   53   19   10   8
ANTH   230   01   Robert        Sappington    28    1    7    9    9    2    1
ANTH   251   01   Donald        Tyler         36    1    10   14   8    1    3
ANTH   420   01   Laura         Putsche       12         3    4    2         2
ANTH   422   01   Rodney        Frey          13         11                  2
ANTH   427   02   Virginia      Babcock       13    1         2    6 4       1
ANTH   462   01   Laura         Putsche       33    3    8    20   3 1
ARBC   101   01   Anisah        El-Mansouri   14    1    8    5    1
ARCH   151   01   Randall       Teal          150   8    72   40   13 6      19
ARCH   253   01   Roman         Montoto       23    1    9    10   2         1
ARCH   253   02   Randall       Teal          22    2    9    11   2
ARCH   253   03   Xiao          Hu            23    2    11   12
ARCH   353   01   Matthew       Brehm         16         7    7    1
ARCH   353   02   Dillon        Ellefson      16         4    11   1
ARCH   353   03   Xiao          Hu            10         4    6
ARCH   385   01   Anne          Marshall      68    5    29   22   11 2      4
ARCH   404   04   Matthew       Brehm         10         1    5    3 1
ARCH   453   01   Roman         Montoto       10         5    4    1
ARCH   453   02   Anne       Marshall              13        6     5             1
ARCH   463   01   Phillip    Mead                  63    1   26    31   5 1
ARCH   465   01   Kenneth    Carper                51    1   8     26   12 3
ARCH   483   01   D.         Reese                 71    2   27    35   8
ARCH   504   02   Randall    Teal                  15        9     6
ARCH   504   03   Kevin      Van Den Wymelenberg   6         3     1             1
ARCH   504   04   Frank      Jacobus               12    1   8     4
ARCH   510   02   D.         Reese                 13        9     4
ARCH   510   04   Robert     Thornton              9         7     1
ARCH   510   05   Roman      Montoto               11    2   7     4
ARCH   553   01   Bruce      Haglund               14        12    2

我有这个代码/ sub来获取每一行并假设产生一个相关的列表:

sub GetData {

    my $non_nor_line              = shift;
    my( $subj, $crs,$sec, $rest ) = unpack "a6 a6 a6 a*", $non_nor_line;
    my $name                      = undef;
    my $upk_short  = q{A3A2A3A2 A3A2 A3AA5 A6};





    $rest =~ m/(.+?)\d/;
    $name = $1;
    $rest =~ s/$1//;
    $rest =~ s/^\s+//;
    $rest =~ s/\s+$//;
    my @rest_data                 = unpack($upk_short,$rest);    


    print $_ ."\n" foreach(@rest_data);


}

我不知道如何从$ rest获取数据,我尝试了解压缩的许多变化,但无济于事,我需要将它存储到列表中。 忽略'upk_short',它不正确,虽然我尝试了很多其他的,看起来线条太动态了。

更新:如果有人能找到一种规范化文本的方法,那就没关系,我的意思是将所有内容对齐,以便我可以使用Tom的方式来解析它。

有什么想法吗?

2 个答案:

答案 0 :(得分:5)

#!/usr/bin/env perl

use strict;
use warnings;

sub cut2fmt {
    my @positions  = @_;
    my $template   = "";
    my $lastpos    = 1;
    for my $place (@positions) {
        $template .= "A" . ($place - $lastpos) . " ";
        $lastpos   = $place;
    }
    $template .= "A*";
    return $template;
}

my $fmt = cut2fmt(9, 16, 26, 45, 68, 90, 112, 117, 122, 127, 131);

my @keys = qw{

    subject                 course              section

    instructor_first_name   instructor_last_name

    completed_the_class     dropped_the_class

    grade_A                 grade_B
    grade_C                 grade_D
    grade_F

};

our @All_Records;

while (<DATA>) {
    next if 1 .. /^\s*\|/;
    my %rec;
    @rec{@keys} = unpack($fmt, $_);
    for my $key (grep { /^grade_[A-F]$/ } @keys) {
        $rec{$key} ||= 0;
    }
    push @All_Records, \%rec;
}

for my $rec (@All_Records) {
    for my $key (@keys) {
        print "$key: $rec->{$key}\n";
    }
    print "\n";

}

__END__
Subject Course Section                      Instructor Last Name                                               A    B C       D F
                         Name                                      Completed the Class   Dropped the Class
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
         1         2         3         4         5         6         7         8         9         0         1         2         3         4
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
        |      |         |                  |                      |                     |                     |    |    |    |   |
ACCT    201    01        Karin              Hatheway Dial          56                    6                     19   9    16   2   5
ACCT    202    01        Karin              Hatheway Dial          69                    11                    37   14   7    2   6
ACCT    205    01        Darryl             Woolley                20                    1                     3    7    6    1   3
ACCT    205    02        Darryl             Woolley                28                    1                     6    7    13       2
ACCT    205    03        Darryl             Woolley                42                    5                     4    13   21   1   3
ACCT    205    04        Darryl             Woolley                23                    1                     9    5    8    1
ACCT    205    05        Darryl             Woolley                30                    2                     11   7    9    2   1
ACCT    205    06        Darryl             Woolley                25                    3                     8    9    6    1   1
ACCT    275    01        Darryl             Woolley                33                    2                     7    15   9    1   1
ACCT    310    01        Marla              Kraut                  16                    1                     1    6    7    2
ACCT    310    02        Marla              Kraut                  64                                          5    43   15   1
ACCT    310    03        Marla              Kraut                  72                    3                     11   47   10   3   1
ACCT    311    01        Karin              Hatheway Dial          45                                          13   20   11   1
ACCT    311    02        Karin              Hatheway Dial          25                                          10   12   3
ACCT    315    01        Jason              Porter                 26                                          6    5    8    6   1
ACCT    315    02        Jason              Porter                 29                    1                     6    10   5    7   1
ACCT    414    01        Teresa             Gordon                 22                    1                     6    6    9    1
ACCT    483    01        Glen               Utzman                 26                    1                     7    13   6
ACCT    486    01        Teresa             Gordon                 33                                          13   14   6
ACCT    492    01        Jason              Wills                  23                                          5    8    9    1
ACCT    515    01        Jeffrey            Harkins                15                                          7    6    1
ACCT    561    01        Jason              Porter                 18                    1                     10   7    1
ADOL    526    13        Charles            Gagel                  21                    2                     19   1             1
ADOL    573    13        Martha             Yopp                   28                                          16   3             1
ADOL    574    01        Laura              Holyoke                16                                          12   3             1
ADOL    574    11        Laura              Holyoke                9                     1                     8    1
ADOL    574    13        Laura              Holyoke                15                                          10   4             1
ADOL    600    13        Roger              Scott                  19                                          4         1
AERO    101    01        William            Beauter                11                                          8    2    1
AERO    103    01        Sarah              Babbitt                15                                          7    6    1        1
AERO    411    01        Sarah              Babbitt                11                                          6    4    1
AERO    413    01        Sarah              Babbitt                12                                          8    3    1

您要做的第一件事是规范化您的数据。你的列不一致,我不能告诉你为什么会这样。也许您的标签需要通过expand -8或其他东西进行管道传输。我只包括所有相同对齐的数据。

要每次都使你的unpack格式正确,你只需要画一个像我在它下面的编号标尺。在每个字段开始的位置放置|标记。记录该数字,并将其传递给包含的cut2fmt()函数。它会将这些数字转换为pack/unpack模板。

这就是它的全部内容。

告诉你这些小块会在哪里来,但我只是讨厌强大的自我推销者,所以我弯腰这么低是虚伪的。我不会这样做。如果有人想做广告,那么,让他们从网站上购买垃圾邮件。那些讨厌垃圾广告的人可能会阻止我们的广告拦截器。否则对我来说只是不合时宜。

答案 1 :(得分:2)

您的数据看起来很奇怪,那里有标签吗?它看起来像三组记录,每组都有不同的布局 - 这是正确的吗?

如果数据处于固定位置,则应该立即解开所有12列的包装。如果有三种类型的布局,我会使用正则表达式来决定哪种布局适用于当前行,然后为该组记录使用适当的模板。

由于12个数据列中的某些列可能为空白,而某些记录的数字位于异常位置,因此可能无法将某些值归入正确的列。


修改

#!/usr/bin/perl
use strict;
use warnings;

my @heading = qw(Subject Course Section Firstname Lastname
                 Completed Dropped A B C D F);

# Use position of Instructors Last Name as a guide to line layout.
my %template = (45 => "A8 A7 A10 A19 A23 A22 A22 A5 A5 A5 A4 A4",
                33 => "A7 A6 A5  A14 A14 A6  A5  A5 A5 A5 A5 A5",
                30 => "A7 A6 A5  A11 A22 A6  A4  A6 A5 A2 A2 A2");

while(<DATA>) {
  next unless /^[A-Z]{4} /;
  chomp;
  GetData($_);
}

sub GetData {
  my $line = shift;
  for my $lastname_position (keys %template) {
    if (substr($line, $lastname_position-2, 2) =~ / [A-Z]/) {
      my @values = unpack ($template{$lastname_position}, $line);
      my $column=0;
      for my $value(@values) {
        print "$heading[$column] = '$value'\n";
        $column++;
      }
      print "\n";
      last;
    }
  }
}

__DATA__
                         Instructor First                          Number Students Who   Number Students Who
Subject Course Section                      Instructor Last Name                                               A    B C       D F
                         Name                                      Completed the Class   Dropped the Class
ACCT    201    01        Karin              Hatheway Dial          56                    6                     19   9    16   2   5
ACCT    202    01        Karin              Hatheway Dial          69                    11                    37   14   7    2   6
ACCT    205    01        Darryl             Woolley                20                    1                     3    7    6    1   3
ACCT    205    02        Darryl             Woolley                28                    1                     6    7    13       2
ACCT    205    03        Darryl             Woolley                42                    5                     4    13   21   1   3
ACCT    205    04        Darryl             Woolley                23                    1                     9    5    8    1
ACCT    205    05        Darryl             Woolley                30                    2                     11   7    9    2   1
ACCT    205    06        Darryl             Woolley                25                    3                     8    9    6    1   1
ACCT    275    01        Darryl             Woolley                33                    2                     7    15   9    1   1
ACCT    310    01        Marla              Kraut                  16                    1                     1    6    7    2
ACCT    310    02        Marla              Kraut                  64                                          5    43   15   1
ACCT    310    03        Marla              Kraut                  72                    3                     11   47   10   3   1
ACCT    311    01        Karin              Hatheway Dial          45                                          13   20   11   1
ACCT    311    02        Karin              Hatheway Dial          25                                          10   12   3
ACCT    315    01        Jason              Porter                 26                                          6    5    8    6   1
ACCT    315    02        Jason              Porter                 29                    1                     6    10   5    7   1
ACCT    414    01        Teresa             Gordon                 22                    1                     6    6    9    1
ACCT    483    01        Glen               Utzman                 26                    1                     7    13   6
ACCT    486    01        Teresa             Gordon                 33                                          13   14   6
ACCT    492    01        Jason              Wills                  23                                          5    8    9    1
ACCT    515    01        Jeffrey            Harkins                15                                          7    6    1
ACCT    561    01        Jason              Porter                 18                    1                     10   7    1
ADOL    526    13        Charles            Gagel                  21                    2                     19   1             1
ADOL    573    13        Martha             Yopp                   28                                          16   3             1
ADOL    574    01        Laura              Holyoke                16                                          12   3             1
ADOL    574    11        Laura              Holyoke                9                     1                     8    1
ADOL    574    13        Laura              Holyoke                15                                          10   4             1
ADOL    600    13        Roger              Scott                  19                                          4         1
AERO    101    01        William            Beauter                11                                          8    2    1
AERO    103    01        Sarah              Babbitt                15                                          7    6    1        1
AERO    411    01        Sarah              Babbitt                11                                          6    4    1
AERO    413    01        Sarah              Babbitt                12                                          8    3    1
AGEC   101   01   Larry         Van Tassell   36    1    20   15        1
AGEC   278   01   Larry         Makus         21    1    2    6    8    5
AGEC   278   02   Larry         Makus         18         5    10   2    1
AGEC   278   03   Larry         Makus         17    1    2    7    5    2    1
AGEC   301   01   Christopher   McIntosh      18         9    4    5
AGEC   356   01   Joseph        Guenthner     23         15   6    2
AGEC   361   01   Ruby          Stroschein    11         4    1    6
AGEC   411   01   Robert        Haggerty      11         6    4    1
AGEC   413   01   Robert        Spear         12    3    4    5    2    1
AGEC   415   01   Larry         Van Tassell   11         10   1
AGEC   526   01   Scott         Matulich      7          2    5
AGEC   527   01   Stephen       Cooke         5          3    2
AGED   180   01   Lori          Moore         23    1    14   5    1    3
AGED   351   01   Lou           Riesenberg    11         4    6    1
AMST   301   01   Walter        Hesford       26         14   8    3         1
ANTH   100   01   Mark          Warner        104   15   31   31   21   8    12
ANTH   220   01   Fumiyasu      Arakawa       138   4    48   53   19   10   8
ANTH   230   01   Robert        Sappington    28    1    7    9    9    2    1
ANTH   251   01   Donald        Tyler         36    1    10   14   8    1    3
ANTH   420   01   Laura         Putsche       12         3    4    2         2
ANTH   422   01   Rodney        Frey          13         11                  2
ANTH   427   02   Virginia      Babcock       13    1         2    6 4       1
ANTH   462   01   Laura         Putsche       33    3    8    20   3 1
ARBC   101   01   Anisah        El-Mansouri   14    1    8    5    1
ARCH   151   01   Randall       Teal          150   8    72   40   13 6      19
ARCH   253   01   Roman         Montoto       23    1    9    10   2         1
ARCH   253   02   Randall       Teal          22    2    9    11   2
ARCH   253   03   Xiao          Hu            23    2    11   12
ARCH   353   01   Matthew       Brehm         16         7    7    1
ARCH   353   02   Dillon        Ellefson      16         4    11   1
ARCH   353   03   Xiao          Hu            10         4    6
ARCH   385   01   Anne          Marshall      68    5    29   22   11 2      4
ARCH   404   04   Matthew       Brehm         10         1    5    3 1
ARCH   453   01   Roman         Montoto       10         5    4    1
ARCH   453   02   Anne       Marshall              13        6     5             1
ARCH   463   01   Phillip    Mead                  63    1   26    31   5 1
ARCH   465   01   Kenneth    Carper                51    1   8     26   12 3
ARCH   483   01   D.         Reese                 71    2   27    35   8
ARCH   504   02   Randall    Teal                  15        9     6
ARCH   504   03   Kevin      Van Den Wymelenberg   6         3     1             1
ARCH   504   04   Frank      Jacobus               12    1   8     4
ARCH   510   02   D.         Reese                 13        9     4
ARCH   510   04   Robert     Thornton              9         7     1
ARCH   510   05   Roman      Montoto               11    2   7     4
ARCH   553   01   Bruce      Haglund               14        12    2

输出

Subject = 'ACCT'
Course = '201'
Section = '01'
Firstname = 'Karin'
Lastname = 'Hatheway Dial'
Completed = '56'
Dropped = '6'
A = '19'
B = '9'
C = '16'
D = '2'
F = '5'

...

Subject = 'AGEC'
Course = '101'
Section = '01'
Firstname = 'Larry'
Lastname = 'Van Tassell'
Completed = '36'
Dropped = '1'
A = '20'
B = '15'
C = ''
D = '1'
F = ''

...

Subject = 'ARCH'
Course = '553'
Section = '01'
Firstname = 'Bruce'
Lastname = 'Haglund'
Completed = '14'
Dropped = ''
A = '12'
B = '2'
C = ''
D = ''
F = ''

但数据确实需要更清洁。