有没有办法从表中有效地识别/提取标题:Perl

时间:2013-11-14 10:58:39

标签: php xml regex perl awk

我正在尝试编写一个perl脚本,以便从文本文件中提供的任意表格数据生成xml。为了便于讨论,我想从linux命令

获取输出
 df -k

并将其解析为我的perl脚本并动态生成xml。

示例check_disk_usage.log

 Filesystem           1K-blocks      Used Available Use% Mounted on
 /dev/sda3             56776092   5431448  48413988  11% /
 /dev/sda1               101086     18993     76874  20% /boot
 tmpfs                  2021888         0   2021888   0% /dev/shm

现在为了生成XML,我需要从这个表中提取标题并将它们存储在一个数组中供以后使用(它们将用作XML中的开始和结束标记) 我这样做的方式:

 open my $file, '<', "$dir/check_disk_usage.log"; 
 my $firstLine = <$file>; 
 close $file; 

 my (@header) = $firstLine =~ /(\S+)/g; 

即我正在寻找所有一个或多个非空白模式(实际上是一个单词)并将它们保存在一个数组中。 只要标题名称遵循单个单词

的模式,这就可以正常工作
 eg Filesystem,1K-blocks,Used etc

然而,当遇到标题名称s.a“Mounted on”时,它将断开,因为“Mounted”和“on”都将被视为不同的模式,因此将被存储为不同的数组元素。 有没有一种方法可以有效地识别/提取表格中的标题。

PS:我知道,我可以使用awk替换有问题的模式,然后解析文件。但是之后我需要事先知道“违规模式”,这是不可行的,因为我打算为任意表格数据编写这个脚本。

PSS:虽然我正在使用perl,但我也可以使用其他解决方案(例如php等)

感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

从数据的外观来看,值是分开的,其中每行都有空格。如果某些行有空格而有些行没有,则它不是分隔符。这导致使用掩码来确定标题的分割位置。

有点难看,但是:

#!/usr/bin/perl
# Read the file provided on STDIN and then determine the delimiters,
# printing the individual elements per line.

my @lines = map { chomp; $_ } <>;

# The mask indicates if a character has ever been a NON whitespace character
my @mask  = ();

foreach my $line (@lines) {
    my @line = split //, $line;
    foreach my $index (0..$#line) {
        $mask[$index] ||= $line[$index] =~ /\S/;
    }
}

# At this point the mask indicates where to split based on the zeros within it.
# Want to turn this into substr ranges.
# So 000011110000 would become 4, 4

my @substrings = (); # will contain [from, length]
my $last_transition = 0;
my $last_value = $mask[0];

# When it transitions from 0 to 1 or 1 to 0 the $last_transition is updated
# When the last value was a 1 it means it has stopped being a section and needs
# to be made into a split.
foreach my $index (1..$#mask) {
    if ($mask[$index] != $last_value) {
        if ($last_value) {
            push @substrings, [$last_transition, ($index + 1 - $last_transition)];
        }
        $last_transition = $index;
        $last_value = $mask[$index];
    }
}
# Handle the end of the line, which is considered a transition to 0
if ( $last_value ) {
    push @substrings, [$last_transition, ($#mask + 1 - $last_transition)];
}

# Just print them to show that it works, you would collect these instead.
foreach my $line (@lines) {
    foreach my $split (@substrings) {
        my $element = substr $line, $split->[0], $split->[1];
        $element =~ s/(?:^\s+|\s+$)//;
        print "$line -> $element\n";
    }
}

输出:

Filesystem           1K-blocks      Used Available Use% Mounted on -> Filesystem
Filesystem           1K-blocks      Used Available Use% Mounted on -> 1K-blocks
Filesystem           1K-blocks      Used Available Use% Mounted on -> Used 
Filesystem           1K-blocks      Used Available Use% Mounted on -> Available
Filesystem           1K-blocks      Used Available Use% Mounted on -> Use%
Filesystem           1K-blocks      Used Available Use% Mounted on -> Mounted on
/dev/sda3             56776092   5431448  48413988  11% / -> /dev/sda3
/dev/sda3             56776092   5431448  48413988  11% / -> 56776092 
/dev/sda3             56776092   5431448  48413988  11% / -> 5431448
/dev/sda3             56776092   5431448  48413988  11% / -> 48413988 
/dev/sda3             56776092   5431448  48413988  11% / -> 11% 
/dev/sda3             56776092   5431448  48413988  11% / -> /
/dev/sda1               101086     18993     76874  20% /boot -> /dev/sda1
/dev/sda1               101086     18993     76874  20% /boot -> 101086 
/dev/sda1               101086     18993     76874  20% /boot -> 18993 
/dev/sda1               101086     18993     76874  20% /boot -> 76874 
/dev/sda1               101086     18993     76874  20% /boot -> 20% 
/dev/sda1               101086     18993     76874  20% /boot -> /boot
tmpfs                  2021888         0   2021888   0% /dev/shm -> tmpfs
tmpfs                  2021888         0   2021888   0% /dev/shm -> 2021888 
tmpfs                  2021888         0   2021888   0% /dev/shm -> 0 
tmpfs                  2021888         0   2021888   0% /dev/shm -> 2021888 
tmpfs                  2021888         0   2021888   0% /dev/shm -> 0% 
tmpfs                  2021888         0   2021888   0% /dev/shm -> /dev/shm

显然,您会将第一行处理为元素而不是将其打印出来。