Question

我正在尝试编写一个perl脚本，以便从文本文件中提供的任意表格数据生成xml。为了便于讨论，我想从linux命令

获取输出

 df -k

并将其解析为我的perl脚本并动态生成xml。

示例check_disk_usage.log

 Filesystem           1K-blocks      Used Available Use% Mounted on
 /dev/sda3             56776092   5431448  48413988  11% /
 /dev/sda1               101086     18993     76874  20% /boot
 tmpfs                  2021888         0   2021888   0% /dev/shm

现在为了生成XML，我需要从这个表中提取标题并将它们存储在一个数组中供以后使用（它们将用作XML中的开始和结束标记）我这样做的方式：

 open my $file, '<', "$dir/check_disk_usage.log"; 
 my $firstLine = <$file>; 
 close $file; 

 my (@header) = $firstLine =~ /(\S+)/g;

即我正在寻找所有一个或多个非空白模式（实际上是一个单词）并将它们保存在一个数组中。只要标题名称遵循单个单词

的模式，这就可以正常工作

 eg Filesystem,1K-blocks,Used etc

然而，当遇到标题名称s.a“Mounted on”时，它将断开，因为“Mounted”和“on”都将被视为不同的模式，因此将被存储为不同的数组元素。有没有一种方法可以有效地识别/提取表格中的标题。

PS：我知道，我可以使用awk替换有问题的模式，然后解析文件。但是之后我需要事先知道“违规模式”，这是不可行的，因为我打算为任意表格数据编写这个脚本。

PSS：虽然我正在使用perl，但我也可以使用其他解决方案（例如php等）

感谢您的帮助。

Answer 1

从数据的外观来看，值是分开的，其中每行都有空格。如果某些行有空格而有些行没有，则它不是分隔符。这导致使用掩码来确定标题的分割位置。

有点难看，但是：

#!/usr/bin/perl
# Read the file provided on STDIN and then determine the delimiters,
# printing the individual elements per line.

my @lines = map { chomp; $_ } <>;

# The mask indicates if a character has ever been a NON whitespace character
my @mask  = ();

foreach my $line (@lines) {
    my @line = split //, $line;
    foreach my $index (0..$#line) {
        $mask[$index] ||= $line[$index] =~ /\S/;
    }
}

# At this point the mask indicates where to split based on the zeros within it.
# Want to turn this into substr ranges.
# So 000011110000 would become 4, 4

my @substrings = (); # will contain [from, length]
my $last_transition = 0;
my $last_value = $mask[0];

# When it transitions from 0 to 1 or 1 to 0 the $last_transition is updated
# When the last value was a 1 it means it has stopped being a section and needs
# to be made into a split.
foreach my $index (1..$#mask) {
    if ($mask[$index] != $last_value) {
        if ($last_value) {
            push @substrings, [$last_transition, ($index + 1 - $last_transition)];
        }
        $last_transition = $index;
        $last_value = $mask[$index];
    }
}
# Handle the end of the line, which is considered a transition to 0
if ( $last_value ) {
    push @substrings, [$last_transition, ($#mask + 1 - $last_transition)];
}

# Just print them to show that it works, you would collect these instead.
foreach my $line (@lines) {
    foreach my $split (@substrings) {
        my $element = substr $line, $split->[0], $split->[1];
        $element =~ s/(?:^\s+|\s+$)//;
        print "$line -> $element\n";
    }
}

输出：

Filesystem           1K-blocks      Used Available Use% Mounted on -> Filesystem
Filesystem           1K-blocks      Used Available Use% Mounted on -> 1K-blocks
Filesystem           1K-blocks      Used Available Use% Mounted on -> Used 
Filesystem           1K-blocks      Used Available Use% Mounted on -> Available
Filesystem           1K-blocks      Used Available Use% Mounted on -> Use%
Filesystem           1K-blocks      Used Available Use% Mounted on -> Mounted on
/dev/sda3             56776092   5431448  48413988  11% / -> /dev/sda3
/dev/sda3             56776092   5431448  48413988  11% / -> 56776092 
/dev/sda3             56776092   5431448  48413988  11% / -> 5431448
/dev/sda3             56776092   5431448  48413988  11% / -> 48413988 
/dev/sda3             56776092   5431448  48413988  11% / -> 11% 
/dev/sda3             56776092   5431448  48413988  11% / -> /
/dev/sda1               101086     18993     76874  20% /boot -> /dev/sda1
/dev/sda1               101086     18993     76874  20% /boot -> 101086 
/dev/sda1               101086     18993     76874  20% /boot -> 18993 
/dev/sda1               101086     18993     76874  20% /boot -> 76874 
/dev/sda1               101086     18993     76874  20% /boot -> 20% 
/dev/sda1               101086     18993     76874  20% /boot -> /boot
tmpfs                  2021888         0   2021888   0% /dev/shm -> tmpfs
tmpfs                  2021888         0   2021888   0% /dev/shm -> 2021888 
tmpfs                  2021888         0   2021888   0% /dev/shm -> 0 
tmpfs                  2021888         0   2021888   0% /dev/shm -> 2021888 
tmpfs                  2021888         0   2021888   0% /dev/shm -> 0% 
tmpfs                  2021888         0   2021888   0% /dev/shm -> /dev/shm

显然，您会将第一行处理为元素而不是将其打印出来。

有没有办法从表中有效地识别/提取标题：Perl

1 个答案: