Question

我有以下正则表达式：

my $scores_compiled_regex  = qr{^0
                                  \s+
                                  (\p{Alpha}+\d*)
                                  \s+
                                  (\d+
                                  \s*
                                   \p{Alpha}*)
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}                              
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s+
                                   \d+ #$
                                   }xos

它应匹配这些行（来自普通的txt文件）：

0            AAS  211    1   1       5       2   6                                                                         15

列名称为：

0 INST, NAME             A  A-  B+   B  B-  C+   C  C-  D+   D  D-   F  CR   P  PR   I  I*   W  WP  WF  AU  NR  FN  FS

表示：分数A = 1，分数A- = 1，无分数B +，分数B = 5等。我试图将它拆分为一个列表，而不是忽略空列，它可以工作，但非常慢，匹配也很慢，而且我的意思是慢，超过5秒，有时甚至更多！

文件中的前几个文件如下：

0 PALMER, JAN            A  A-  B+   B  B-  C+   C  C-  D+   D  D-   F  CR   P  PR   I  I*   W  WP  WF  AU  NR  FN  FS   TOTAL
0            ECON 103   98      35 114   1  14  75           9      35               1          10       1

分数是指在右侧A列之后的任何内容。

任何想法？谢谢，

Answer 1

查看我的program：

use strict;
use warnings;

# Column details and sample line, from the post
my $header  = q{0 AOZSVIN, TAMSSZ B      A  A-  B+   B  B-  C+   C  C-  D+   D  D-   F  CR   P  PR   I  I*   W  WP  WF  AU  NR  FN  FS};
my $sample  = q{0            AAS  150   23  25  16  35  45  14   8  10   2   1   1   4                           4                     };
#               -+--------+-----+-----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---..
# chars         1212345678912345612345612341234123412341234123412341234123412341234123412341234123412341234123412341234123412341234...
# num. chars:   2 9        6     6     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   *
my $unpack  = q{A2A9       A6    A6    A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A*};
$unpack =~ s/\s//g;

# Get column names from the "$header" variable above
my @column_names = unpack($unpack, $header);
s/\s+$// for @column_names; # get rid of trailing spaces
s/^\s+// for @column_names; # get rid of leading spaces

# Some sample data in same format, to try the script out
my @samples = (
  q{0            AAS  150   23  25  16  35  45  14   8  10   2   1   1   4                           4                     },
  q{0            AAS  353    2   3   5   2   6       1                   2                                                     },
  q{0            T304 480M   3  10   8   8   2   3   2   1                                               1               1    },
  q{0            BIOS 206    3  14   5  11   9   8   4   8   3   1   1   6                           7                      },
);

my @big_sample = (@samples) ;#x 200_000;

my @unpacked_data_as_arrayrefs;
m    y @unpacked_data_as_hashrefs;
my $begin = time;
for my $line ( @big_sample ) {
    my @data = unpack($unpack,$line);
    s/\s+$// for @data; # get rid of trailing spaces
    s/^\s+// for @data; # get rid of leading spaces
    push @unpacked_data_as_arrayrefs, [@data]; # stop here if this is all you need
    ## below converts the data in a hash, based on the column names given
    #my %as_hash;
    #for ( 0..$#column_names ) {
    #    $as_hash{ $column_names[$_] } = $data[$_];
    #}
    #push @unpacked_data_as_hashrefs, { %as_hash };
}
my $tot = time - $begin;
print "Done in $tot seconds\n";

# verify all data is as we expected
# uncomment the ones that test hashref, if the above hashref-building code is also uncommented.
{
    use Test::More;
    # first sample
    is($unpacked_data_as_arrayrefs[0]->[2],'AAS'); # AAS in the third column
    is($unpacked_data_as_arrayrefs[0]->[7],'35');  # 35 in the 8th column
    # fourth sample
    is($unpacked_data_as_arrayrefs[3]->[2],'BIOS');
    is($unpacked_data_as_arrayrefs[3]->[15],'6');
    # sixth
    is($unpacked_data_as_arrayrefs[5]->[7],'114');
    is($unpacked_data_as_arrayrefs[5]->[10],'75');
    done_testing();
}

它使用unpack根据字符串中字段的宽度（以字符为单位）将文本拆分为多个块。另请参阅perlpacktut以获取有关如何使用unpack进行此类字符串重整的更多详细信息。对于这种格式，解包可能是最好的，因为它与正则表达式相比表现得非常快（在我的机器上在约6秒内解析600_000个这样的字符串）。

如果您需要了解该计划的任何部分，请告知我们。我没有在这里发布，因为它有点偏长（更好的评论而不是！）。请告诉我你是否愿意。

Answer 2

如果您必须接受的格式与您的正则表达式当前接受的格式一样松散，则会遇到一个大问题：如果缺少一个或多个数字字段，并且有多个出现的4个空格连续，然后它的含糊不清哪个分数对应哪个列。

Perl的回溯将通过选择“最左边，最长”匹配来解决模糊性，但是（a）这不一定是你想要的和（b）它需要尝试的可能性的数量是多少数字字段的指数你在线上缺失，因此很慢。

为了说明，让我们使用更简单的正则表达式：

/\A(\d+)?\s{2,}
   (\d+)?\s{2,}
   (\d+)?\s{2,}
   (\d+)?\z/xs;

假设输入是：

123    456    789

（每个数字之间有四个空格。）现在，456应该是第二个还是第三个字段返回？两者都是有效的比赛。在这种情况下，Perl的回溯将使它成为第二个领域，但我怀疑你真的想依靠Perl的回溯来决定这一点。

建议：如果可能，请使用固定大小的空格匹配正则表达式替换每个\s{2,}。如果您只允许它是可变大小的，因为数字按列排列，数字可能是1或2位数，那么只需使用substr()从已知的列偏移量而不是正则表达式中获取。（使用正则表达式无法有效地解析固定宽度的数据。）

Answer 3

如果列可以为空，或者（a）您的数据不明确并且您遇到的问题比慢速正则表达式更大，或者（b）您的数据采用固定宽度格式，如下所示：

NAME   A     A-
foo    123   456
bar          789
fubb   111

如果你有固定宽度的数据，适当的解析工具是substr（或unpack），而不是正则表达式。

Answer 4

不要使用正则表达式。它看起来像一个固定列格式，因此unpack会更快。

这是一个展示问题内容的示例程序。您仍然需要弄清楚如何整合它，以便您知道新人记录何时开始等等。我之所以这样，解压缩值的格式主要来自标题，因此您不必花太多时间计算列数（但也可以轻松响应列位置的变化）：

chomp( my $header = <DATA> );
my( $num, $name, $rest ) = unpack "a2 a20 a*", $header;
my @grades = split /(?=\s+)/, $rest;

my @grade_keys = map { /(\S+)/} @grades;

my $format = 'a13 a4 a5 ' . join ' ', map { 'a' . length } @grades;

while( <DATA> ) {
    my( $key, $label, $number, @grades ) = unpack $format, $_;

    $$_ =~ s/\s//g foreach ( \$key, \$label, \$number );

    @{ $hash{$key}{$label}{$number} }{@grade_keys} = 
         map { s/\s//g; $_ } @grades;
    }

use Data::Dumper;   
print Dumper( \%hash );

您说您遇到了问题，因为某些列的值为三位数。除非这与网格不对齐，因此最低有效数字与其列中的最后一个非空白字符不对齐，否则此代码应该有效。

这是我为“AOZSVIN，TAMSSZ B”制作的数据结构（其示例数据现在隐藏在您的问题编辑中），尽管您可以随意安排它：

$VAR1 = {
          '0' => {
                   'BIOS' => {
                               '206' => {
                                          'F' => '6',
                                          'AU' => '',
                                          'FS' => '',
                                          'B-' => '9',
                                          'D+' => '3',
                                          'CR' => '',
                                          'B+' => '5',
                                          'WP' => '7',
                                          'C+' => '8',
                                          'NR' => '',
                                          'C' => '4',
                                          'PR' => '',
                                          'A' => '3',
                                          'W' => '',
                                          'I*' => '',
                                          'A-' => '14',
                                          'P' => '',
                                          'WF' => '',
                                          'B' => '11',
                                          'FN' => '',
                                          'D' => '1',
                                          'D-' => '1',
                                          'I' => '',
                                          'C-' => '8'
                                        }
                             },
                   'AAS' => {
                              '353' => {
                                         'F' => '2',
                                         'AU' => '',
                                         'FS' => '',
                                         'B-' => '6',
                                         'D+' => '',
                                         'CR' => '',
                                         'B+' => '5',
                                         'WP' => '',
                                         'C+' => '',
                                         'NR' => '',
                                         'C' => '1',
                                         'PR' => '',
                                         'A' => '2',
                                         'W' => '',
                                         'I*' => '',
                                         'A-' => '3',
                                         'P' => '',
                                         'WF' => '',
                                         'B' => '2',
                                         'FN' => '',
                                         'D' => '',
                                         'D-' => '',
                                         'I' => '',
                                         'C-' => ''
                                       },
                              '150' => {
                                         'F' => '4',
                                         'AU' => '',
                                         'FS' => '',
                                         'B-' => '45',
                                         'D+' => '2',
                                         'CR' => '',
                                         'B+' => '16',
                                         'WP' => '4',
                                         'C+' => '14',
                                         'NR' => '',
                                         'C' => '8',
                                         'PR' => '',
                                         'A' => '23',
                                         'W' => '',
                                         'I*' => '',
                                         'A-' => '25',
                                         'P' => '',
                                         'WF' => '',
                                         'B' => '35',
                                         'FN' => '',
                                         'D' => '1',
                                         'D-' => '1',
                                         'I' => '',
                                         'C-' => '10'
                                       }
                            },
                   'T304' => {
                               '480M' => {
                                           'F' => '',
                                           'AU' => '',
                                           'FS' => '1',
                                           'B-' => '2',
                                           'D+' => '',
                                           'CR' => '',
                                           'B+' => '8',
                                           'WP' => '',
                                           'C+' => '3',
                                           'NR' => '',
                                           'C' => '2',
                                           'PR' => '',
                                           'A' => '3',
                                           'W' => '',
                                           'I*' => '',
                                           'A-' => '10',
                                           'P' => '',
                                           'WF' => '1',
                                           'B' => '8',
                                           'FN' => '',
                                           'D' => '',
                                           'D-' => '',
                                           'I' => '',
                                           'C-' => '1'
                                         }
                             }
                 }
        };

对于“Palmer，Jan”的新样本：

$VAR1 = {
          '0' => {
                   'ECON' => {
                               '103' => {
                                          'F' => '35',
                                          'AU' => '1',
                                          'FS' => '',
                                          'B-' => '1',
                                          'D+' => '',
                                          'CR' => '',
                                          'B+' => '35',
                                          'WP' => '10',
                                          'C+' => '14',
                                          'NR' => '',
                                          'C' => '75',
                                          'PR' => '',
                                          'A' => '98',
                                          'W' => '',
                                          'I*' => '',
                                          'A-' => '',
                                          'P' => '',
                                          'WF' => '',
                                          'B' => '114',
                                          'FN' => '',
                                          'TOTAL' => '',
                                          'D' => '9',
                                          'D-' => '',
                                          'I' => '1',
                                          'C-' => ''
                                        }
                             }
                 }
        };

Answer 5

首先将线条分成固定宽度的块空间和全部。然后清理块。否则你会尝试同时做两件事，这可能容易出错。

为什么我的Perl正则表达式这么慢？

5 个答案: