字符串中的正则表达式

时间:2015-06-16 11:48:42

标签: regex perl

我在xml文件中有一些信息。我想使用perl脚本在xml文件中找出样式属性值。

xml文件内容:

<ul type="disc">
    <li class="MsoNormal" style="line-height: normal; margin: 0in 0in 10pt; color: black; mso-list: l1 level1 lfo2; tab-stops: list .5in; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto">
        <span style="mso-fareast-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'"><font size="3"><font face="Calibri">Highlight the items you want to recover.</font></font></span></li>
</ul>

Perl脚本代码段

while ($line =~ /style="([a-zA-Z0-9]+)"/gis) {
                if ($articlenbfound == 1) {
                    $articlehits++;
                    my $thelink = $1;
                    disp_str(linktofile($dir . $name . $ext) . "   line " . $index . ": <font color=red>Article " . $articlenb . " match</font>: " . $thelink . "\n");
                }
            }

在这个脚本中,我正在捕捉Style属性值。并且需要打印所有样式属性值。

1 个答案:

答案 0 :(得分:3)

那是XML。通过正则表达式解析XML是个坏主意。原因是因为这些XML片段在语义上是相同的:

<ul type="disc">
  <li
      class="MsoNormal"
      style="line-height: normal; margin: 0in 0in 10pt; color: black; mso-list: l1 level1 lfo2; tab-stops: list .5in; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto">
    <span style="mso-fareast-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'">
      <font size="3">
        <font face="Calibri">Highlight the items you want to recover.</font>
      </font>
    </span>
  </li>
</ul>

<ul
type="disc"
><li
class="MsoNormal"
style="line-height: normal; margin: 0in 0in 10pt; color: black; mso-list: l1 level1 lfo2; tab-stops: list .5in; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"
><span
style="mso-fareast-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'"
><font
size="3"
><font
face="Calibri"
>Highlight the items you want to recover.</font></font></span></li></ul>

<ul type="disc"><li class="MsoNormal" style="line-height: normal; margin: 0in 0in 10pt; color: black; mso-list: l1 level1 lfo2; tab-stops: list .5in; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto"><span style="mso-fareast-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'"><font size="3"><font face="Calibri">Highlight the items you want to recover.</font></font></span></li></ul>

所以请 - 使用解析器。由于您已标记perl,我将包含perl解决方案:

use strict;
use warnings;
use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'span' => sub { print $_ ->att('style'), "\n" }
    }
)->parsefile ( 'your_file.xml' );

这将在新行上打印style元素中的span属性。解压缩后,您可以通过拆分;并使用:作为键值分隔符将其转换为键值。

E.g:

my $style =  $_ ->att('style'); 
my %styles = map { split ( ': ', $_, 2 ) } split ( '; ', $style);
print Dumper \%styles; 

但你所做的正是你要完成的事情的问题。