我有以下正则表达式:
my $scores_compiled_regex = qr{^0
\s+
(\p{Alpha}+\d*)
\s+
(\d+
\s*
\p{Alpha}*)
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s{2,}
(\d+)?
\s+
\d+ #$
}xos
它应匹配这些行(来自普通的txt文件):
0 AAS 211 1 1 5 2 6 15
列名称为:
0 INST, NAME A A- B+ B B- C+ C C- D+ D D- F CR P PR I I* W WP WF AU NR FN FS
表示:分数A = 1,分数A- = 1,无分数B +,分数B = 5等。 我试图将它拆分为一个列表,而不是忽略空列,它可以工作,但非常慢,匹配也很慢,而且我的意思是慢,超过5秒,有时甚至更多!
文件中的前几个文件如下:
0 PALMER, JAN A A- B+ B B- C+ C C- D+ D D- F CR P PR I I* W WP WF AU NR FN FS TOTAL
0 ECON 103 98 35 114 1 14 75 9 35 1 10 1
分数是指在右侧A列之后的任何内容。
任何想法? 谢谢,
答案 0 :(得分:5)
查看我的program:
use strict;
use warnings;
# Column details and sample line, from the post
my $header = q{0 AOZSVIN, TAMSSZ B A A- B+ B B- C+ C C- D+ D D- F CR P PR I I* W WP WF AU NR FN FS};
my $sample = q{0 AAS 150 23 25 16 35 45 14 8 10 2 1 1 4 4 };
# -+--------+-----+-----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---..
# chars 1212345678912345612345612341234123412341234123412341234123412341234123412341234123412341234123412341234123412341234...
# num. chars: 2 9 6 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 *
my $unpack = q{A2A9 A6 A6 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A4 A*};
$unpack =~ s/\s//g;
# Get column names from the "$header" variable above
my @column_names = unpack($unpack, $header);
s/\s+$// for @column_names; # get rid of trailing spaces
s/^\s+// for @column_names; # get rid of leading spaces
# Some sample data in same format, to try the script out
my @samples = (
q{0 AAS 150 23 25 16 35 45 14 8 10 2 1 1 4 4 },
q{0 AAS 353 2 3 5 2 6 1 2 },
q{0 T304 480M 3 10 8 8 2 3 2 1 1 1 },
q{0 BIOS 206 3 14 5 11 9 8 4 8 3 1 1 6 7 },
);
my @big_sample = (@samples) ;#x 200_000;
my @unpacked_data_as_arrayrefs;
m y @unpacked_data_as_hashrefs;
my $begin = time;
for my $line ( @big_sample ) {
my @data = unpack($unpack,$line);
s/\s+$// for @data; # get rid of trailing spaces
s/^\s+// for @data; # get rid of leading spaces
push @unpacked_data_as_arrayrefs, [@data]; # stop here if this is all you need
## below converts the data in a hash, based on the column names given
#my %as_hash;
#for ( 0..$#column_names ) {
# $as_hash{ $column_names[$_] } = $data[$_];
#}
#push @unpacked_data_as_hashrefs, { %as_hash };
}
my $tot = time - $begin;
print "Done in $tot seconds\n";
# verify all data is as we expected
# uncomment the ones that test hashref, if the above hashref-building code is also uncommented.
{
use Test::More;
# first sample
is($unpacked_data_as_arrayrefs[0]->[2],'AAS'); # AAS in the third column
is($unpacked_data_as_arrayrefs[0]->[7],'35'); # 35 in the 8th column
# fourth sample
is($unpacked_data_as_arrayrefs[3]->[2],'BIOS');
is($unpacked_data_as_arrayrefs[3]->[15],'6');
# sixth
is($unpacked_data_as_arrayrefs[5]->[7],'114');
is($unpacked_data_as_arrayrefs[5]->[10],'75');
done_testing();
}
它使用unpack根据字符串中字段的宽度(以字符为单位)将文本拆分为多个块。另请参阅perlpacktut以获取有关如何使用unpack进行此类字符串重整的更多详细信息。对于这种格式,解包可能是最好的,因为它与正则表达式相比表现得非常快(在我的机器上在约6秒内解析600_000个这样的字符串)。
如果您需要了解该计划的任何部分,请告知我们。我没有在这里发布,因为它有点偏长(更好的评论而不是!)。请告诉我你是否愿意。
答案 1 :(得分:4)
如果您必须接受的格式与您的正则表达式当前接受的格式一样松散,则会遇到一个大问题:如果缺少一个或多个数字字段,并且有多个出现的4个空格连续,然后它的含糊不清哪个分数对应哪个列。
Perl的回溯将通过选择“最左边,最长”匹配来解决模糊性,但是(a)这不一定是你想要的和(b)它需要尝试的可能性的数量是多少数字字段的指数你在线上缺失,因此很慢。
为了说明,让我们使用更简单的正则表达式:
/\A(\d+)?\s{2,}
(\d+)?\s{2,}
(\d+)?\s{2,}
(\d+)?\z/xs;
假设输入是:
123 456 789
(每个数字之间有四个空格。)现在,456应该是第二个还是第三个字段返回?两者都是有效的比赛。在这种情况下,Perl的回溯将使它成为第二个领域,但我怀疑你真的想依靠Perl的回溯来决定这一点。
建议:如果可能,请使用固定大小的空格匹配正则表达式替换每个\s{2,}
。如果您只允许它是可变大小的,因为数字按列排列,数字可能是1或2位数,那么只需使用substr()
从已知的列偏移量而不是正则表达式中获取。 (使用正则表达式无法有效地解析固定宽度的数据。)
答案 2 :(得分:3)
如果列可以为空,或者(a)您的数据不明确并且您遇到的问题比慢速正则表达式更大,或者(b)您的数据采用固定宽度格式,如下所示:
NAME A A-
foo 123 456
bar 789
fubb 111
答案 3 :(得分:3)
不要使用正则表达式。它看起来像一个固定列格式,因此unpack会更快。
这是一个展示问题内容的示例程序。您仍然需要弄清楚如何整合它,以便您知道新人记录何时开始等等。我之所以这样,解压缩值的格式主要来自标题,因此您不必花太多时间计算列数(但也可以轻松响应列位置的变化):
chomp( my $header = <DATA> );
my( $num, $name, $rest ) = unpack "a2 a20 a*", $header;
my @grades = split /(?=\s+)/, $rest;
my @grade_keys = map { /(\S+)/} @grades;
my $format = 'a13 a4 a5 ' . join ' ', map { 'a' . length } @grades;
while( <DATA> ) {
my( $key, $label, $number, @grades ) = unpack $format, $_;
$$_ =~ s/\s//g foreach ( \$key, \$label, \$number );
@{ $hash{$key}{$label}{$number} }{@grade_keys} =
map { s/\s//g; $_ } @grades;
}
use Data::Dumper;
print Dumper( \%hash );
您说您遇到了问题,因为某些列的值为三位数。除非这与网格不对齐,因此最低有效数字与其列中的最后一个非空白字符不对齐,否则此代码应该有效。
这是我为“AOZSVIN,TAMSSZ B”制作的数据结构(其示例数据现在隐藏在您的问题编辑中),尽管您可以随意安排它:
$VAR1 = {
'0' => {
'BIOS' => {
'206' => {
'F' => '6',
'AU' => '',
'FS' => '',
'B-' => '9',
'D+' => '3',
'CR' => '',
'B+' => '5',
'WP' => '7',
'C+' => '8',
'NR' => '',
'C' => '4',
'PR' => '',
'A' => '3',
'W' => '',
'I*' => '',
'A-' => '14',
'P' => '',
'WF' => '',
'B' => '11',
'FN' => '',
'D' => '1',
'D-' => '1',
'I' => '',
'C-' => '8'
}
},
'AAS' => {
'353' => {
'F' => '2',
'AU' => '',
'FS' => '',
'B-' => '6',
'D+' => '',
'CR' => '',
'B+' => '5',
'WP' => '',
'C+' => '',
'NR' => '',
'C' => '1',
'PR' => '',
'A' => '2',
'W' => '',
'I*' => '',
'A-' => '3',
'P' => '',
'WF' => '',
'B' => '2',
'FN' => '',
'D' => '',
'D-' => '',
'I' => '',
'C-' => ''
},
'150' => {
'F' => '4',
'AU' => '',
'FS' => '',
'B-' => '45',
'D+' => '2',
'CR' => '',
'B+' => '16',
'WP' => '4',
'C+' => '14',
'NR' => '',
'C' => '8',
'PR' => '',
'A' => '23',
'W' => '',
'I*' => '',
'A-' => '25',
'P' => '',
'WF' => '',
'B' => '35',
'FN' => '',
'D' => '1',
'D-' => '1',
'I' => '',
'C-' => '10'
}
},
'T304' => {
'480M' => {
'F' => '',
'AU' => '',
'FS' => '1',
'B-' => '2',
'D+' => '',
'CR' => '',
'B+' => '8',
'WP' => '',
'C+' => '3',
'NR' => '',
'C' => '2',
'PR' => '',
'A' => '3',
'W' => '',
'I*' => '',
'A-' => '10',
'P' => '',
'WF' => '1',
'B' => '8',
'FN' => '',
'D' => '',
'D-' => '',
'I' => '',
'C-' => '1'
}
}
}
};
对于“Palmer,Jan”的新样本:
$VAR1 = {
'0' => {
'ECON' => {
'103' => {
'F' => '35',
'AU' => '1',
'FS' => '',
'B-' => '1',
'D+' => '',
'CR' => '',
'B+' => '35',
'WP' => '10',
'C+' => '14',
'NR' => '',
'C' => '75',
'PR' => '',
'A' => '98',
'W' => '',
'I*' => '',
'A-' => '',
'P' => '',
'WF' => '',
'B' => '114',
'FN' => '',
'TOTAL' => '',
'D' => '9',
'D-' => '',
'I' => '1',
'C-' => ''
}
}
}
};
答案 4 :(得分:-1)
首先将线条分成固定宽度的块空间和全部。然后清理块。否则你会尝试同时做两件事,这可能容易出错。