使用双引号修复HTML属性值

时间:2012-05-16 02:09:38

标签: html regex perl text xml-twig

我在href标记的<a>属性中有一组包含非法语法的HTML文件。例如,

<a name="Conductor, "neutral""></a>

<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />

<b>Table of Contents:</b><ul class="xoxo"><li><a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">What are "series" and "parallel" circuits?</a>

我正在尝试使用XML::Twig使用Perl的parsefile_html($file_name)模块处理文件。当它读取具有此语法的文件时,会出现此错误:

x has an invalid attribute name 'y""' at C:/strawberry/perl/site/lib/XML/Twig.pm line 893

我需要的是使模块接受错误语法并处理它的方法,或使用正则表达式来查找和替换带有单引号的属性中的双引号。

2 个答案:

答案 0 :(得分:2)

鉴于你的html示例,下面的代码有效:

use Modern::Perl;

my $html = <<end;
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
end

$html =~ s/(?<=content=")(.*?)(?="\s*\/>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;
$html =~ s/(?<=name=")(.*?)(?="\s*>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;

say $html;

输出:

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>

我担心没有实现可变长度的后视,所以如果在等号之前或之后有一些空格,则模式匹配将失败。但是,页面最有可能始终创建,因此匹配不会失败。

当然,首先尝试替换文件的副本。

答案 1 :(得分:1)

我能够合理安全地做到这一点的唯一方法是使用两个嵌套的评估(/e)替换。下面的程序使用这个想法并使用您的数据。

外部替换查找字符串中的所有标记,并用包含调整后的属性值的标记替换它们。

内部子结构查找标记中的所有属性值,并用相同的值替换它们,并删除所有双引号。

use strict;
use warnings;

my $html = <<'HTML';
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
<a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">
HTML

$html =~ s{(<[^>]+>)}{

  my $tag = $1;

  $tag =~ s{ \w+= " \K ( [^=<>]+ ) (?= " (?: \s+\w+= | \s*/?> )) }
  {
    (my $attr = $1) =~ tr/"//d;
    $attr;
  }egx;

  $tag;
}eg;

print $html;

<强>输出

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>
<a href="1.html" title="Page 1: What are series and parallel circuits?">