我有以下文字
Instructor First Number Students Who Number Students Who
Subject Course Section Instructor Last Name A B C D F
Name Completed the Class Dropped the Class
ACCT 201 01 Karin Hatheway Dial 56 6 19 9 16 2 5
ACCT 202 01 Karin Hatheway Dial 69 11 37 14 7 2 6
ACCT 205 01 Darryl Woolley 20 1 3 7 6 1 3
ACCT 205 02 Darryl Woolley 28 1 6 7 13 2
ACCT 205 03 Darryl Woolley 42 5 4 13 21 1 3
ACCT 205 04 Darryl Woolley 23 1 9 5 8 1
ACCT 205 05 Darryl Woolley 30 2 11 7 9 2 1
ACCT 205 06 Darryl Woolley 25 3 8 9 6 1 1
ACCT 275 01 Darryl Woolley 33 2 7 15 9 1 1
ACCT 310 01 Marla Kraut 16 1 1 6 7 2
ACCT 310 02 Marla Kraut 64 5 43 15 1
ACCT 310 03 Marla Kraut 72 3 11 47 10 3 1
ACCT 311 01 Karin Hatheway Dial 45 13 20 11 1
ACCT 311 02 Karin Hatheway Dial 25 10 12 3
ACCT 315 01 Jason Porter 26 6 5 8 6 1
ACCT 315 02 Jason Porter 29 1 6 10 5 7 1
ACCT 414 01 Teresa Gordon 22 1 6 6 9 1
ACCT 483 01 Glen Utzman 26 1 7 13 6
ACCT 486 01 Teresa Gordon 33 13 14 6
ACCT 492 01 Jason Wills 23 5 8 9 1
ACCT 515 01 Jeffrey Harkins 15 7 6 1
ACCT 561 01 Jason Porter 18 1 10 7 1
ADOL 526 13 Charles Gagel 21 2 19 1 1
ADOL 573 13 Martha Yopp 28 16 3 1
ADOL 574 01 Laura Holyoke 16 12 3 1
ADOL 574 11 Laura Holyoke 9 1 8 1
ADOL 574 13 Laura Holyoke 15 10 4 1
ADOL 600 13 Roger Scott 19 4 1
AERO 101 01 William Beauter 11 8 2 1
AERO 103 01 Sarah Babbitt 15 7 6 1 1
AERO 411 01 Sarah Babbitt 11 6 4 1
AERO 413 01 Sarah Babbitt 12 8 3 1
AGEC 101 01 Larry Van Tassell 36 1 20 15 1
AGEC 278 01 Larry Makus 21 1 2 6 8 5
AGEC 278 02 Larry Makus 18 5 10 2 1
AGEC 278 03 Larry Makus 17 1 2 7 5 2 1
AGEC 301 01 Christopher McIntosh 18 9 4 5
AGEC 356 01 Joseph Guenthner 23 15 6 2
AGEC 361 01 Ruby Stroschein 11 4 1 6
AGEC 411 01 Robert Haggerty 11 6 4 1
AGEC 413 01 Robert Spear 12 3 4 5 2 1
AGEC 415 01 Larry Van Tassell 11 10 1
AGEC 526 01 Scott Matulich 7 2 5
AGEC 527 01 Stephen Cooke 5 3 2
AGED 180 01 Lori Moore 23 1 14 5 1 3
AGED 351 01 Lou Riesenberg 11 4 6 1
AMST 301 01 Walter Hesford 26 14 8 3 1
ANTH 100 01 Mark Warner 104 15 31 31 21 8 12
ANTH 220 01 Fumiyasu Arakawa 138 4 48 53 19 10 8
ANTH 230 01 Robert Sappington 28 1 7 9 9 2 1
ANTH 251 01 Donald Tyler 36 1 10 14 8 1 3
ANTH 420 01 Laura Putsche 12 3 4 2 2
ANTH 422 01 Rodney Frey 13 11 2
ANTH 427 02 Virginia Babcock 13 1 2 6 4 1
ANTH 462 01 Laura Putsche 33 3 8 20 3 1
ARBC 101 01 Anisah El-Mansouri 14 1 8 5 1
ARCH 151 01 Randall Teal 150 8 72 40 13 6 19
ARCH 253 01 Roman Montoto 23 1 9 10 2 1
ARCH 253 02 Randall Teal 22 2 9 11 2
ARCH 253 03 Xiao Hu 23 2 11 12
ARCH 353 01 Matthew Brehm 16 7 7 1
ARCH 353 02 Dillon Ellefson 16 4 11 1
ARCH 353 03 Xiao Hu 10 4 6
ARCH 385 01 Anne Marshall 68 5 29 22 11 2 4
ARCH 404 04 Matthew Brehm 10 1 5 3 1
ARCH 453 01 Roman Montoto 10 5 4 1
ARCH 453 02 Anne Marshall 13 6 5 1
ARCH 463 01 Phillip Mead 63 1 26 31 5 1
ARCH 465 01 Kenneth Carper 51 1 8 26 12 3
ARCH 483 01 D. Reese 71 2 27 35 8
ARCH 504 02 Randall Teal 15 9 6
ARCH 504 03 Kevin Van Den Wymelenberg 6 3 1 1
ARCH 504 04 Frank Jacobus 12 1 8 4
ARCH 510 02 D. Reese 13 9 4
ARCH 510 04 Robert Thornton 9 7 1
ARCH 510 05 Roman Montoto 11 2 7 4
ARCH 553 01 Bruce Haglund 14 12 2
我有这个代码/ sub来获取每一行并假设产生一个相关的列表:
sub GetData {
my $non_nor_line = shift;
my( $subj, $crs,$sec, $rest ) = unpack "a6 a6 a6 a*", $non_nor_line;
my $name = undef;
my $upk_short = q{A3A2A3A2 A3A2 A3AA5 A6};
$rest =~ m/(.+?)\d/;
$name = $1;
$rest =~ s/$1//;
$rest =~ s/^\s+//;
$rest =~ s/\s+$//;
my @rest_data = unpack($upk_short,$rest);
print $_ ."\n" foreach(@rest_data);
}
我不知道如何从$ rest获取数据,我尝试了解压缩的许多变化,但无济于事,我需要将它存储到列表中。 忽略'upk_short',它不正确,虽然我尝试了很多其他的,看起来线条太动态了。
更新:如果有人能找到一种规范化文本的方法,那就没关系,我的意思是将所有内容对齐,以便我可以使用Tom的方式来解析它。
有什么想法吗?
答案 0 :(得分:5)
#!/usr/bin/env perl
use strict;
use warnings;
sub cut2fmt {
my @positions = @_;
my $template = "";
my $lastpos = 1;
for my $place (@positions) {
$template .= "A" . ($place - $lastpos) . " ";
$lastpos = $place;
}
$template .= "A*";
return $template;
}
my $fmt = cut2fmt(9, 16, 26, 45, 68, 90, 112, 117, 122, 127, 131);
my @keys = qw{
subject course section
instructor_first_name instructor_last_name
completed_the_class dropped_the_class
grade_A grade_B
grade_C grade_D
grade_F
};
our @All_Records;
while (<DATA>) {
next if 1 .. /^\s*\|/;
my %rec;
@rec{@keys} = unpack($fmt, $_);
for my $key (grep { /^grade_[A-F]$/ } @keys) {
$rec{$key} ||= 0;
}
push @All_Records, \%rec;
}
for my $rec (@All_Records) {
for my $key (@keys) {
print "$key: $rec->{$key}\n";
}
print "\n";
}
__END__
Subject Course Section Instructor Last Name A B C D F
Name Completed the Class Dropped the Class
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
1 2 3 4 5 6 7 8 9 0 1 2 3 4
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
| | | | | | | | | | |
ACCT 201 01 Karin Hatheway Dial 56 6 19 9 16 2 5
ACCT 202 01 Karin Hatheway Dial 69 11 37 14 7 2 6
ACCT 205 01 Darryl Woolley 20 1 3 7 6 1 3
ACCT 205 02 Darryl Woolley 28 1 6 7 13 2
ACCT 205 03 Darryl Woolley 42 5 4 13 21 1 3
ACCT 205 04 Darryl Woolley 23 1 9 5 8 1
ACCT 205 05 Darryl Woolley 30 2 11 7 9 2 1
ACCT 205 06 Darryl Woolley 25 3 8 9 6 1 1
ACCT 275 01 Darryl Woolley 33 2 7 15 9 1 1
ACCT 310 01 Marla Kraut 16 1 1 6 7 2
ACCT 310 02 Marla Kraut 64 5 43 15 1
ACCT 310 03 Marla Kraut 72 3 11 47 10 3 1
ACCT 311 01 Karin Hatheway Dial 45 13 20 11 1
ACCT 311 02 Karin Hatheway Dial 25 10 12 3
ACCT 315 01 Jason Porter 26 6 5 8 6 1
ACCT 315 02 Jason Porter 29 1 6 10 5 7 1
ACCT 414 01 Teresa Gordon 22 1 6 6 9 1
ACCT 483 01 Glen Utzman 26 1 7 13 6
ACCT 486 01 Teresa Gordon 33 13 14 6
ACCT 492 01 Jason Wills 23 5 8 9 1
ACCT 515 01 Jeffrey Harkins 15 7 6 1
ACCT 561 01 Jason Porter 18 1 10 7 1
ADOL 526 13 Charles Gagel 21 2 19 1 1
ADOL 573 13 Martha Yopp 28 16 3 1
ADOL 574 01 Laura Holyoke 16 12 3 1
ADOL 574 11 Laura Holyoke 9 1 8 1
ADOL 574 13 Laura Holyoke 15 10 4 1
ADOL 600 13 Roger Scott 19 4 1
AERO 101 01 William Beauter 11 8 2 1
AERO 103 01 Sarah Babbitt 15 7 6 1 1
AERO 411 01 Sarah Babbitt 11 6 4 1
AERO 413 01 Sarah Babbitt 12 8 3 1
您要做的第一件事是规范化您的数据。你的列不一致,我不能告诉你为什么会这样。也许您的标签需要通过expand -8
或其他东西进行管道传输。我只包括所有相同对齐的数据。
要每次都使你的unpack
格式正确,你只需要画一个像我在它下面的编号标尺。在每个字段开始的位置放置|
标记。记录该数字,并将其传递给包含的cut2fmt()
函数。它会将这些数字转换为pack/unpack
模板。
这就是它的全部内容。
我会告诉你这些小块会在哪里来,但我只是讨厌强大的自我推销者,所以我弯腰这么低是虚伪的。我不会这样做。如果有人想做广告,那么,让他们从网站上购买垃圾邮件。那些讨厌垃圾广告的人可能会阻止我们的广告拦截器。否则对我来说只是不合时宜。
答案 1 :(得分:2)
您的数据看起来很奇怪,那里有标签吗?它看起来像三组记录,每组都有不同的布局 - 这是正确的吗?
如果数据处于固定位置,则应该立即解开所有12列的包装。如果有三种类型的布局,我会使用正则表达式来决定哪种布局适用于当前行,然后为该组记录使用适当的模板。
由于12个数据列中的某些列可能为空白,而某些记录的数字位于异常位置,因此可能无法将某些值归入正确的列。
修改
#!/usr/bin/perl
use strict;
use warnings;
my @heading = qw(Subject Course Section Firstname Lastname
Completed Dropped A B C D F);
# Use position of Instructors Last Name as a guide to line layout.
my %template = (45 => "A8 A7 A10 A19 A23 A22 A22 A5 A5 A5 A4 A4",
33 => "A7 A6 A5 A14 A14 A6 A5 A5 A5 A5 A5 A5",
30 => "A7 A6 A5 A11 A22 A6 A4 A6 A5 A2 A2 A2");
while(<DATA>) {
next unless /^[A-Z]{4} /;
chomp;
GetData($_);
}
sub GetData {
my $line = shift;
for my $lastname_position (keys %template) {
if (substr($line, $lastname_position-2, 2) =~ / [A-Z]/) {
my @values = unpack ($template{$lastname_position}, $line);
my $column=0;
for my $value(@values) {
print "$heading[$column] = '$value'\n";
$column++;
}
print "\n";
last;
}
}
}
__DATA__
Instructor First Number Students Who Number Students Who
Subject Course Section Instructor Last Name A B C D F
Name Completed the Class Dropped the Class
ACCT 201 01 Karin Hatheway Dial 56 6 19 9 16 2 5
ACCT 202 01 Karin Hatheway Dial 69 11 37 14 7 2 6
ACCT 205 01 Darryl Woolley 20 1 3 7 6 1 3
ACCT 205 02 Darryl Woolley 28 1 6 7 13 2
ACCT 205 03 Darryl Woolley 42 5 4 13 21 1 3
ACCT 205 04 Darryl Woolley 23 1 9 5 8 1
ACCT 205 05 Darryl Woolley 30 2 11 7 9 2 1
ACCT 205 06 Darryl Woolley 25 3 8 9 6 1 1
ACCT 275 01 Darryl Woolley 33 2 7 15 9 1 1
ACCT 310 01 Marla Kraut 16 1 1 6 7 2
ACCT 310 02 Marla Kraut 64 5 43 15 1
ACCT 310 03 Marla Kraut 72 3 11 47 10 3 1
ACCT 311 01 Karin Hatheway Dial 45 13 20 11 1
ACCT 311 02 Karin Hatheway Dial 25 10 12 3
ACCT 315 01 Jason Porter 26 6 5 8 6 1
ACCT 315 02 Jason Porter 29 1 6 10 5 7 1
ACCT 414 01 Teresa Gordon 22 1 6 6 9 1
ACCT 483 01 Glen Utzman 26 1 7 13 6
ACCT 486 01 Teresa Gordon 33 13 14 6
ACCT 492 01 Jason Wills 23 5 8 9 1
ACCT 515 01 Jeffrey Harkins 15 7 6 1
ACCT 561 01 Jason Porter 18 1 10 7 1
ADOL 526 13 Charles Gagel 21 2 19 1 1
ADOL 573 13 Martha Yopp 28 16 3 1
ADOL 574 01 Laura Holyoke 16 12 3 1
ADOL 574 11 Laura Holyoke 9 1 8 1
ADOL 574 13 Laura Holyoke 15 10 4 1
ADOL 600 13 Roger Scott 19 4 1
AERO 101 01 William Beauter 11 8 2 1
AERO 103 01 Sarah Babbitt 15 7 6 1 1
AERO 411 01 Sarah Babbitt 11 6 4 1
AERO 413 01 Sarah Babbitt 12 8 3 1
AGEC 101 01 Larry Van Tassell 36 1 20 15 1
AGEC 278 01 Larry Makus 21 1 2 6 8 5
AGEC 278 02 Larry Makus 18 5 10 2 1
AGEC 278 03 Larry Makus 17 1 2 7 5 2 1
AGEC 301 01 Christopher McIntosh 18 9 4 5
AGEC 356 01 Joseph Guenthner 23 15 6 2
AGEC 361 01 Ruby Stroschein 11 4 1 6
AGEC 411 01 Robert Haggerty 11 6 4 1
AGEC 413 01 Robert Spear 12 3 4 5 2 1
AGEC 415 01 Larry Van Tassell 11 10 1
AGEC 526 01 Scott Matulich 7 2 5
AGEC 527 01 Stephen Cooke 5 3 2
AGED 180 01 Lori Moore 23 1 14 5 1 3
AGED 351 01 Lou Riesenberg 11 4 6 1
AMST 301 01 Walter Hesford 26 14 8 3 1
ANTH 100 01 Mark Warner 104 15 31 31 21 8 12
ANTH 220 01 Fumiyasu Arakawa 138 4 48 53 19 10 8
ANTH 230 01 Robert Sappington 28 1 7 9 9 2 1
ANTH 251 01 Donald Tyler 36 1 10 14 8 1 3
ANTH 420 01 Laura Putsche 12 3 4 2 2
ANTH 422 01 Rodney Frey 13 11 2
ANTH 427 02 Virginia Babcock 13 1 2 6 4 1
ANTH 462 01 Laura Putsche 33 3 8 20 3 1
ARBC 101 01 Anisah El-Mansouri 14 1 8 5 1
ARCH 151 01 Randall Teal 150 8 72 40 13 6 19
ARCH 253 01 Roman Montoto 23 1 9 10 2 1
ARCH 253 02 Randall Teal 22 2 9 11 2
ARCH 253 03 Xiao Hu 23 2 11 12
ARCH 353 01 Matthew Brehm 16 7 7 1
ARCH 353 02 Dillon Ellefson 16 4 11 1
ARCH 353 03 Xiao Hu 10 4 6
ARCH 385 01 Anne Marshall 68 5 29 22 11 2 4
ARCH 404 04 Matthew Brehm 10 1 5 3 1
ARCH 453 01 Roman Montoto 10 5 4 1
ARCH 453 02 Anne Marshall 13 6 5 1
ARCH 463 01 Phillip Mead 63 1 26 31 5 1
ARCH 465 01 Kenneth Carper 51 1 8 26 12 3
ARCH 483 01 D. Reese 71 2 27 35 8
ARCH 504 02 Randall Teal 15 9 6
ARCH 504 03 Kevin Van Den Wymelenberg 6 3 1 1
ARCH 504 04 Frank Jacobus 12 1 8 4
ARCH 510 02 D. Reese 13 9 4
ARCH 510 04 Robert Thornton 9 7 1
ARCH 510 05 Roman Montoto 11 2 7 4
ARCH 553 01 Bruce Haglund 14 12 2
输出
Subject = 'ACCT'
Course = '201'
Section = '01'
Firstname = 'Karin'
Lastname = 'Hatheway Dial'
Completed = '56'
Dropped = '6'
A = '19'
B = '9'
C = '16'
D = '2'
F = '5'
...
Subject = 'AGEC'
Course = '101'
Section = '01'
Firstname = 'Larry'
Lastname = 'Van Tassell'
Completed = '36'
Dropped = '1'
A = '20'
B = '15'
C = ''
D = '1'
F = ''
...
Subject = 'ARCH'
Course = '553'
Section = '01'
Firstname = 'Bruce'
Lastname = 'Haglund'
Completed = '14'
Dropped = ''
A = '12'
B = '2'
C = ''
D = ''
F = ''
但数据确实需要更清洁。