Question

我是Perl的新手，并尝试构建一个脚本来解析来自 IBM SPSS Statistics （SPSS）的一些输出文件，以自动生成某些标准过程的语法（在此示例中），重新编码和指定缺失值）。

此时，我已删除了一些无关的行，并通过一些替换正则表达式（我将输入记录分隔符关闭以进行多行替换）将我的文件清理干净并重新格式化。我正在使用的文字如下：

VALUE LABELS ROAD   
0 'No'   
1 'Yes'.

VALUE LABELS NOCALL   
1 'Refused to be interviewed'   
2 'Not at home'   
3 'No one on Premises'   
8 'Other'   
9997 'Not Applicable'   
9999 'Don't Know'.

VALUE LABELS Q1   
999 'Don't know'.     

VALUE LABELS Q2   
1 'Strongly dislike'   
2 'Somewhat dislike'   
3 'Would not care'   
4 'Somewhat like'   
5 'Strongly like'   
7 'Not Applicable'   
9 'Don't know'.

我想在我的脚本中添加正则表达式，它将遍历＆＃34; VALUE LABELS＆＃34;之间的每个块。和＆＃34;。＆＃34;最后，寻找7后跟＆＃34;不适用＆＃34;或者9后跟“不知道＆＃34;”，捕获紧跟在＆＃34; VALUE LABELS＆＃34;之后的变量名称。并将其附加到我的输出的末尾，以便我知道哪些变量具有＆＃34;不适用＆＃34;价值，并且有一个不知道＆＃34;不知道＆＃34;值。所以在这个例子中，我的输出将是原始文件，最后包含这些附加行：

NOT APPLICABLE: NOCALL Q2  
DON'T KNOW: NOCALL Q1 Q2

目前，我不能为我的生活找出如何让我的正则表达式只能在＃34; VALUE LABELS＆＃34;到了这个时期。相反，它要么从第一个＆＃34; VALUE LABELS＆＃34;到最后一个＆＃34; 7不适用＆＃34;跨越街区，或从第一个＆＃34; VALUE LABELS＆＃34;到第一个实例＆＃34; 7不适用＆＃34;，无论NA值是否在同一个区块内。

我目前的Perl代码如下：

#!/bin/perl

use strict;
use warnings;

BEGIN {    # Input and Output Record Separators Off
    $\ = undef;
    $/ = undef;
}

open( my $infile, "<", $ARGV[0]);

my $outfile = "t2" . $ARGV[0];
open( my $write, ">", $outfile);

LINE: while ( <$infile> ) {

    # These are the regexes currently cleaning and reformatting the input

    s/\f/\n/g;
    s/(\d+\s.*)(\n\n)/$1\.$2/g;
    s/(\R\R).*\R\R/$1/g;
    s/(\R\R).*\R\R/$1/g;
    s/(\R\R)(.*\R)/$1VALUE LABELS $2/g;
}
continue {
    die "-p destination: $!\n" unless print $write "$_";
# Here is the regex I'm having an issue with
    if ( $infile =~ m/VALUE LABELS(.*)\n(?s).*\d+7 \x27Not Applicable\x27.*?\./g) {
    print $write "\n\nNOT APPLICABLE: $1";
    ]
}

有没有办法可以让我回报我正在寻找的东西？有没有更好的方法来编写整个脚本，让我可以在一定程度上改变行分隔符？

Answer 1

从表面上看，你要的是range operator。

while (<$fh>)
{   
    if (/^\s*VALUE LABELS/ .. /\.$/) {
        # a line between the two identified above (including them)
        # process as below
    }
}

您的规范“到期”有点简单，但我相信您知道自己的数据。

但是，由于您的文件已经“清理”，因此它们只有显示格式的块，因此您无需确定范围。其余的代码非常简单。

根据数据，我将7或9作为最后一个在一组数字中排在第一位，然后是空格和那些短语。请澄清这是否正确。

my (%res, $label_name);    
while (<$fh>) 
{
    next if /^\s*$/;

    if (/^\s*VALUE LABELS\s*(.*)/) {
        $label_name = $1;
        next;
    }

    if (/^\d*7\s*'(Not Applicable)'/i or /^\d*9\s*'(Don't Know)'/i)  # '
    {
        # $1 has either "Not Applicable" or "Don't Know"
        push @{$res{uc $1}}, $label_name;
    } 
}
print "$_: @{$res{$_}}\n" for keys %res;

这将打印所需的输出。

一旦遇到该行，我们会重置$label_name。也会跳过空行。

数据以哈希值%res结束，键是两个捕获的短语。每个键的值是一个匿名数组，每次检测到短语时都会添加该块的$label_name。这是通过push将其作为该密钥的解除引用数组@{ $res{$1} }来完成的。

有关参考和复杂数据结构，请参阅教程perlreftut和cookbook perldsc。

uc用于根据所需的输出格式更改为大写。这有点浪费，因为uc每次都会运行。您可以省略它并对获得的哈希进行后处理。这确实涉及将哈希复制到新的哈希，这可能会或可能不会更有效。或者，您只能在打印结果时使用uc。

为了将内容附加到文件open，将其添加到附加模式'>>'。见下文。

剩下的就是将其与您显示的处理相关联，以清理数据。我不知道为什么你需要将文件作为字符串处理。这可能是有充分理由的，但是我会不推荐它来解决数据被清理后的问题。多行文本的正则表达式代替上述简单处理，很多变得更难和脆弱。

您需要对代码进行一次更改，以及如何使用记录分隔符。通常，您希望 local ize 他们的更改，而不是在BEGIN块中设置它们。像这样

my $file_content;
CLEAN_UP_DATA: {
    local $/;  # slurp the file ($/ is now undef)
    open my $fh, '<', $file or die "Can't open $file: $!";
    $file_content = <$fh>;    
    # process file content, for example like with code in the question
};

# Here $/ is whatever it was before the block, likely the good old default

我这样命名了块（CLEAN_UP_DATA:），这是没有必要的。最后的分号}; 是。请注意，一旦我们取消设置$/，整个文件就会立即被读入一个字符串。（你的while (<$infile>) 有一次迭代。你可以通过在循环中打印$.来看到这一点。）

然后你可以继续。一种方法是将带有清理内容的字符串分解为行

foreach my $line (split /\n/, $file_content) {
    # process line by line
}

并使用此答案中的代码（或其他逐行方法）。

另一种方法是简单地写出已清理的文件并重新打开。

CLEAN_UP_DATA: {
    local $/;  # slurp the file ($/ is now undef)
    open my $fh, '<', $file or die "Can't open $file: $!";
    my $file_content = <$fh>;    
    # process file content
    my $fh_out, '>', $outfile  or die "Can't open $outfile: $!";
    # write it out
}; 

open my $fh, '<', $outfile  or die "Can't open $outfile: $!";
# Process line by line, obtaining %res
close $fh;

open my $fh_app, '>>', $outfile  or die "Can't open $outfile to append: $!";
# Now append results as needed, for example
print $fh_app "$_: @{$res{$_}}\n" for keys %res;

在这里，您也可以使用此答案中的代码或其他逐行解决方案。

Answer 2

如果保证句号.仅出现在每个块的末尾，那么我建议将其用作输入分隔符

该程序将每个块读入$_并在VALUE LABELS之后提取变量名称。然后检查该块是否 7 Not Applicable 和 9不知道，并且变量名称将添加到%info中的每个短语的列表中那是存在的

输出只是转储哈希

的问题

use strict;
use warnings 'all';

my ($file) = @ARGV;

my %info;

open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};

local $/ = ".";    # Terminate each read at a full stop

while ( <$fh> ) {

    next unless my ($var) = /VALUE LABELS\s+(\S+)/;

    for my $pattern ( qr/7\s+'(Not Applicable)'/i, qr/9 '(Don't Know)'/i ) {
        push @{ $info{uc $1} }, $var if /$pattern/;
    }
}

while ( my ($label, $vars) = each %info ) {
    printf "%s: %s\n", $label, "@$vars";
}

输出

DON'T KNOW: NOCALL Q1 Q2
NOT APPLICABLE: NOCALL Q2

Answer 3

我会将整个输入文件读入单个变量，然后尝试匹配/(VALUE LABELS(.*?)\.\n)/gm之类的内容。 / m修饰符告诉正则表达式引擎使用多行匹配和。*？非贪婪匹配直到换行符之前的第一个点。

然后，在该匹配的结果中，使用第二个正则表达式来查找“不适用”字符串。重复，直到所有输入都被消耗。

Perl - 多线正则表达式＆amp;基于捕获组追加

3 个答案:

输出