我在href
标记的<a>
属性中有一组包含非法语法的HTML文件。例如,
<a name="Conductor, "neutral""></a>
或
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
或
<b>Table of Contents:</b><ul class="xoxo"><li><a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">What are "series" and "parallel" circuits?</a>
我正在尝试使用XML::Twig
使用Perl的parsefile_html($file_name)
模块处理文件。当它读取具有此语法的文件时,会出现此错误:
x has an invalid attribute name 'y""' at C:/strawberry/perl/site/lib/XML/Twig.pm line 893
我需要的是使模块接受错误语法并处理它的方法,或使用正则表达式来查找和替换带有单引号的属性中的双引号。
答案 0 :(得分:2)
鉴于你的html示例,下面的代码有效:
use Modern::Perl;
my $html = <<end;
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
end
$html =~ s/(?<=content=")(.*?)(?="\s*\/>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;
$html =~ s/(?<=name=")(.*?)(?="\s*>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;
say $html;
输出:
<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>
我担心没有实现可变长度的后视,所以如果在等号之前或之后有一些空格,则模式匹配将失败。但是,页面最有可能始终创建,因此匹配不会失败。
当然,首先尝试替换文件的副本。
答案 1 :(得分:1)
我能够合理安全地做到这一点的唯一方法是使用两个嵌套的评估(/e
)替换。下面的程序使用这个想法并使用您的数据。
外部替换查找字符串中的所有标记,并用包含调整后的属性值的标记替换它们。
内部子结构查找标记中的所有属性值,并用相同的值替换它们,并删除所有双引号。
use strict;
use warnings;
my $html = <<'HTML';
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
<a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">
HTML
$html =~ s{(<[^>]+>)}{
my $tag = $1;
$tag =~ s{ \w+= " \K ( [^=<>]+ ) (?= " (?: \s+\w+= | \s*/?> )) }
{
(my $attr = $1) =~ tr/"//d;
$attr;
}egx;
$tag;
}eg;
print $html;
<强>输出强>
<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>
<a href="1.html" title="Page 1: What are series and parallel circuits?">