我正在处理一些doc文件,当复制并粘贴到文本文件中时,会给我以下示例'输出':
ARTA215 ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr. This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required ARTA220 CERAMICS II (3 Cr) (2:2) + Studio 1 hr. This course affords the student the opportunity to ex... Lab Fee Required ARTA250 SPECIAL TOPICS IN ART This course focuses on selected topic.... ARTA260 PORTFOLIO DEVELOPMENT (3 Cr) (3:0) The purpose of this course is to pre.... BIOS010 INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2) This course is a preparatory course designed to familiarize the begi.... BIOS101 GENERAL BIOLOGY (4 Cr) (3:3) This course introduces the student to the principles of mo... Lab Fee Required BIOS102 INTRODUCTION TO HUMAN BIOLOGY (4 Cr) (3:3) This course is an introd.... Lab Fee Required
我希望能够解析它以便生成3个字段,然后我可以将值输出到.csv文件中。
换行符,间距等......就是这个档案中的任何一点
我最好的猜测是一个正则表达式找到4个大写字母字符后跟3个字符字符,然后查明接下来的2个字符是否大写。 (这说明课程#,但也排除了在第一个条目中可能说“先决条件”的地方绊倒的可能性)。在此之后,正则表达式找到第一个换行符并获取它之后的所有内容,直到它找到下一个过程#。 3个字段将是课程编号,课程标题和课程描述。课程编号和标题始终在同一行,描述是下面的一切。
示例最终结果将包含3个字段,我猜测它们可以存储到3个数组中:
"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"
就像我说的那样,这真是一场噩梦,但我希望自动完成这项工作,而不是每次有人在生成文件后进行清理。
答案 0 :(得分:11)
考虑以下示例,该示例依赖于课程描述的块完全包含在Perl认为是段落的内容中:
#! /usr/bin/perl
$/ = "";
my $record_start = qr/
^ # starting with a newline
\s* # allow optional leading whitespace
([A-Z]+\d+) # capture course tag, e.g., ARTA215
\s+ # separating whitespace
(.+?) # course title on rest of line
\s*\n # consume trailing whitespace
/mx;
while (<>) {
my($course,$title);
if (s/\A$record_start//) { # fix Stack Overflow highlighting /
($course,$title) = ($1,$2);
}
elsif (s/(?s:^.+?)(?=$record_start)//) { # ditto /
redo;
}
else {
next;
}
my $desc;
die unless s/^(.+?)(?=$record_start|\s*$)//s;
(my $desc = $1) =~ s/\s*\n\s*/ /g;
for ($course, $title, $desc) {
s/^\s+//; s/\s+$//; s/\s+/ /g;
}
print join("," => map qq{"$_"} => $course, $title, $desc), "\n";
redo if $_;
}
输入样品输入后,输出
"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required" "ARTA220","CERAMICS II (3 Cr) (2:2) + Studio 1 hr.","This course affords the student the opportunity to ex... Lab Fee Required" "ARTA250","SPECIAL TOPICS IN ART","This course focuses on selected topic...." "ARTA260","PORTFOLIO DEVELOPMENT (3 Cr) (3:0)","The purpose of this course is to pre...." "BIOS010","INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2)","This course is a preparatory course designed to familiarize the begi...." "BIOS101","GENERAL BIOLOGY (4 Cr) (3:3)","This course introduces the student to the principles of mo... Lab Fee Required" "BIOS102","INTRODUCTION TO HUMAN BIOLOGY (4 Cr) (3:3)","This course is an introd.... Lab Fee Required"
答案 1 :(得分:7)
尝试:
my $course;
my @courses;
while ( my $line = <$input_handle> ) {
if ( $line =~ /^([A-Z]{4}\d+)\s+([A-Z]{2}.*)/ ) {
$course = [ "$1", "$2" ];
push @courses, $course;
}
elsif ($course) {
$course->[2] .= $line
}
else {
# garbage before first course in file
next
}
}
这会产生一个数组数组,据我所知你想要的。有一个哈希数组甚至散列哈希值对我来说更有意义。
答案 2 :(得分:4)
我有一个大致相同的想法as Gbacon来使用段落模式,因为这样可以将文件整齐地分块到你的记录中。他输入的速度更快,但是我写了一个,所以这是我对它的抨击:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "";
my @items;
while (<>) {
my( $course, $description ) = (split /\n/, $_)[0, 1];
my( $course_id, $name ) = ($course =~ m/^(\w+)\s+(.*)$/);
push @items, [ $course_id, $name, $description ];
}
for my $record (@items) {
print "Course id: ", $record->[0], "\n";
print "Name and credits: ", $record->[1], "\n";
print "Description: ", $record->[2], "\n";
}
正如Ysth在对Gbacon的答案的评论中指出的那样,段落模式在这里可能不起作用。如果没有,请不要介意。
答案 3 :(得分:0)
正则表达式可能有点矫枉过正,因为模式看起来很简单:
[course]
[description]
{Prerequisites}
{Lab Fee Required}
其中[course]由
组成[course#] [course title] {# Cr} [etc/don't care]
,课程#只是前7个字符。
因此您可以使用简单的状态机扫描文件,例如:
//NOTE: THIS IS PSEUDOCODE
s = 'parseCourse'
f = openFile(blah)
l = readLine(f)
while (l) {
if (s=='parseCourse') {
if (l.StartsWith('Prerequisite:')) {
extractPrerequisite(l)
}
else if (l.StartsWith('Lab Fee Required')) {
extractLabFeeRequired(l)
}
else {
extractCourseInfo(l)
s='parseDescription'
}
}
else if (s=='parseDescription') {
extractDescription(l)
s='parseCourse'
}
l = readLine(f)
}
close(f)
答案 4 :(得分:0)
#!/usr/bin/perl
$/ = "\n\n";
$FS = "\n";
$, = ',';
while (<>) {
chomp;
@F = split($FS, $_);
print join($,,@F) ."\n";
}