Question

我正在尝试使用Perl来解析具有如下重复部分的文件：

System:      server1.domain.com
Start Time:  20121021T01:00:56
Stop Time:   20121021T01:00:56
Return Code: 0

Output
------
user1
user2
user3

##############################

System:      server2.domain.com
Start Time:  20121021T01:00:56
Stop Time:   20121021T01:00:56
Return Code: 0

Output
------
user1
user4
user5
user6

我可以将输入记录分隔符设置为“##############################”这将给我每个块作为单独的记录。

但我需要能够使用用户名填充哈希作为每个服务器的密钥。

实现这一目标的最佳方法是什么？

Answer 1

尝试这样做：

use strict; use warnings;
use Data::Dumper;          # one of the top 5 modules you should know

my $hash_of_hashes = {};   # a reference to a void HASH

my $current;

while (<>) {
    chomp;
    if (/^System:\s+(.+)/) {
        $current = $1;
    }
    elsif (/^([^:]+):(.+)/) {
        $hash_of_hashes->{$current}->{$1} = $2;
    }
}

print Dumper $hash_of_hashes; # Dumper is a function of Data::Dumper module
# it prints all the data structure in a human readable way

使用它：

perl script.pl input_file.txt

注意

我假设System:行始终与当前主机的第一行匹配。

Answer 2

您应该查看Perl references。

在Perl预发行版5.0中，您有三种类型的数据结构，并且只能在其中存储标量数据。例如，我可以有一个哈希值，但哈希值的每个值都可以是一个字符串或数字。

Perl 5.0引入了参考资料。引用是指向到另一个数据结构的数据。例如，您可以使用表示服务器的哈希值。散列的每个成员指向包含用户的另一个散列（如果您愿意，还可以指定用户列表）。

例如，您有一个如下所示的哈希：

$system{server1.domain.com}  --->  $anon_array[0] = "user1"
                                   $anon_array[1] = "user2"
                                   $anon_array[2] = "user3"

$system{server2.domain.com}  ----> $another_anon_array[0] = "user1"
                                   $another_anon_array[1] = "user2"
                                   $another_anon_array[2] = "user3"
                                   $another_anon_array[3] = "user4"

您可以在上面看到，%system哈希的键实际指向内存中包含用户列表的某个数组。这些数组没有@foo或@bar等名称。您可以访问它们的唯一方法是使用%system哈希。因此，它们被称为匿名数组。

要创建引用，请在变量前添加反斜杠：

$my_reference = \%my_hash

现在，$my_reference指向散列%my_hash的成员。如果我想再次将引用引入哈希，我会在它之前加上哈希符号（%）：

%bar = %{$my_reference};

您可以使用->语法来表明某些内容指向引用：

$foo->[0];   Points to the first member of an anonymous array.

$bar = [];    #Sets $bar to be a reference to an anonymous array
$foo = {};    #Sets $foo to be a reference to an anonymous hash.

现在，真正的乐趣可以开始了！您现在可以存储整个数据结构，而不是存储单个值。

想象一下这样的事情：

my %system;   #Normal hash keyed by domain name

$system{server1} = {};  # This points to an anonymous hash!
$system{server1}->{START}  = "20121021T01:00:56";
$system{server1}->{STOP}   = "20121021T01:00:56";
$system{server1}->{RETURN} = 0;
$system{server1}->{USERS} = [];  #This hash entry points to an anonymous array
$system{server1}->{USERS}->[0] = "user1";
$system{server1}->{USERS}->[1] = "user2";
$system{server1}->{USERS}->[2] = "user3";

等server2。您有一个由域名键入的哈希%system。 %system哈希中的每个域都有START时间STOP时间，RETURN值，以及该系统上的USERS列表。 server1的开始时间是多少？它是$system{server1}->{START}。 system2上的用户列表是什么？它是@{ $system{server2}->{USERS} }（存储在$system{server2}->{USERS}中的数组的解引用）。

这种新的思维方式需要一些使用，但你可以看到它有助于将你的数据保持在一个单一的结构中。

当然，对于复杂的数据结构来说，存在一些问题。例如：

use strict;
use warnings;

my %server;
$servre{domain1} = "10.10.1.20";

会因为我从未声明$servre而失败。但是：

use strict;
use warnings;
my $hash = {};
$hash->{SERVRE}->{domain1} = "10.10.1.20";

工作得很好。在这种情况下，SERVRE是哈希引用的关键，而不是变量。在这种情况下，use strict;编译指示不会检测到我的拼写错误。这将引导您进入下一步：面向对象的Perl。但是，首先要了解这些新的复杂数据结构及其工作原理。在程序中使用它们之后，您可以开始研究面向对象编程将如何帮助驯服它们造成的混乱。

Answer 3

一个有趣的问题，这让我想要使用paragraph mode。当然，使用####...作为输入记录分隔符是一个想法，但它有点不稳定而且不那么灵活。例如，$/必须是 literal ，这意味着您必须拥有确切的字符数。

如果你可以依赖输入中展示的双重换行，段落模式会将每个“集合”分成两部分读取，然后####...分隔符作为一个容易丢弃的第三部分，以及启动新数据集的信号。此外，通过这种方式，我们可以更方便地访问“用户”部分，这可能有点随机，其唯一的确定特征是它前面有标题“Output \ n ------”。

use strict;
use warnings;
use Data::Dumper;

$/ = "";                             # use paragraph mode
my @data = [];                       # first element must be array ref
while (<DATA>) {
    unless (/^#+\s*$/) {             # if not delimiter
        push @{ $data[-1] }, $_;     # save data in the arrays last element
    } else {
        push @data, [];              # start new array (which becomes the last)
    }
}
my %hash;
for (@data) {
    my ($sys, $out) = @$_;                  # $_ is an array ref w two elements
    my ($server) = $sys =~ /System:\s*(\S+)/;   # extract server name
    my @users = split /\n+/, $out;          # easy extraction of users
    splice @users, 0, 2;                    # remove header
    $hash{$server}{$_} = undef for @users;  # add key w undef value
}

print Dumper \%hash;
__DATA__
System:      server1.domain.com
Start Time:  20121021T01:00:56
Stop Time:   20121021T01:00:56
Return Code: 0

Output
------
user1
user2
user3

##############################

System:      server2.domain.com
Start Time:  20121021T01:00:56
Stop Time:   20121021T01:00:56
Return Code: 0

Output
------
user1
user4
user5
user6

<强>输出：

$VAR1 = {
          'server1.domain.com' => {
                                    'user1' => undef,
                                    'user3' => undef,
                                    'user2' => undef
                                  },
          'server2.domain.com' => {
                                    'user5' => undef,
                                    'user1' => undef,
                                    'user4' => undef,
                                    'user6' => undef
                                  }
        };

有关细节的一些注意事项：

$data[-1]与$data[$#data]相同，我认为它看起来更具可读性。
将空数组引用到@data意味着我们开始收集一个新集。这与上面的注释一起使用。
将数据保存在二维数组中可以省去将“可以预测的”用户名与其他数据分开的麻烦。
在多个换行符上拆分输出块会删除任何麻烦的尾随换行符的数据，这很方便，因为chomp只会在段落模式中删除双倍换行符（除非我们再次更改$/）。
将undef作为值添加到用户名键只是一个占位符，可用于您希望放在那里的任何其他值。
将<DATA>更改为<>，并替换Dumper输出将允许您使用文件名args或stdin与脚本。这些功能仅用于演示。用法是：

some_command | perl script.pl > output.txt
perl script.pl input.txt > output.txt

在Perl中解析具有重复节的文件

3 个答案: