我正在匹配HTML字符串以返回第一个HTML元素并检查它是否具有class属性。我的测试HTML字符串是:
<h3 class="class-name">Blah blah</h3>
以下正则表达式:
/^<[^>]+(?:class=['\"]([^\"]+)['\"])[^>]*>/
返回以下匹配项:
[0] = <h3 class="class-name">
[1] = class-name
但是,只要我将“class”子表达式设为可选:
/^<[^>]+(?:class=['\"]([^\"]+)['\"])?[^>]*>/
我放弃了第二次“类名”比赛:
[0] = <h3 class="class-name">
有人能告诉我我做错了吗?
答案 0 :(得分:2)
有趣的问题。发生的事情是第一个:[^>]+
贪婪地匹配到结束>
。然后它尝试匹配:(?:class...)?
失败。但是,由于此和后面的[^>]*
都可以完全匹配,因此匹配结束>
并声明成功匹配(没有捕获任何内容)。
有趣的是,即使第一个贪婪的表达式变得懒惰,这个“捕获组中没有”行为也会发生:[^>]+?
。
但是你真的做错了是试图用正则表达式解析HTML!
为了说明为什么使用正则表达式来解析HTML并不是一个好主意,请考虑以下有效HTML标记的特性:
<b title="That's <i>entertainment!</i>">bold stuff</b>
。这里的title属性值包含看起来像标签但不是标签的文本。<b title='<i class="this is not a class">is this inside an I element? NO!</i>'>bold stuff</b>
。<b class=myclass>bold stuff</b>
<option selected>
HTML标记的许多其他方面可能会破坏正则表达式(请参阅下面的警告),但我们首先考虑上述方面。
让我们说除了CLASS属性值之外,您还希望捕获元素名称以及ID和TITLE属性的值。所有三个属性都是可选的。标记名称将始终在$1
中捕获,如果存在任何CLASS,ID或TITLE属性值,则会在$2
,$3
和$4
中捕获这些属性值分别。这些属性可以按任何顺序出现,也可以混合使用任意数量的其他属性,每个属性都有或没有值,值可以是双引号,单引号或非引号。
使用Perl / PHP / PCRE正则表达式可以(不完美)完成此(请参阅下面的注意事项),但它很长且很复杂,需要使用(?|...|...)
分支重置构造。这是:
<?php
// Match HTML start tags and print CLASS, ID and TITLE attributes.
// Note that this method is not 100% reliable and can easily fail.
function printTagAttributes($text) {
$re = '%# Match HTML start tags. Capture CLASS, ID and TITLE values.
< # Opening < of start tag.
(\w+) # $1: Element name.
(?: # Group for zero or more attributes.
\s+ # Required whitespace before attribute.
(?: # Group for attribute alternatives.
class\s*=\s* # Match any CLASS attribute value in $2.
(?| # Branch reset group for $2: value.
"([^"]*)" # $2.1: Double quoted value or,
| \'([^\']*)\' # $2.2: Single quoted value or,
| ([\w\-.:]+) # $2.3: Non quoted value.
) # End branch reset group.
| id\s*=\s* # Match any ID attribute value in $3.
(?| # Branch reset group for $3: value.
"([^"]*)" # $3.1: Double quoted value or,
| \'([^\']*)\' # $3.2: Single quoted value or,
| ([\w\-.:]+) # $3.3: Non quoted value.
) # End branch reset group.
| title\s*=\s* # Match any TITLE attribute value in $4.
(?| # Branch reset group for $4: value.
"([^"]*)" # $4.1: Double quoted value or,
| \'([^\']*)\' # $4.2: Single quoted value or,
| ([\w\-.:]+) # $4.3: Non quoted value.
) # End branch reset group.
| [\w\-.:]+ # or match any other attribute.
(?: # Group for optional attrib value.
\s*=\s* # Name and value separated by =
(?: # Group for attrib value alternatives.
"[^"]*" # Either Double quoted value,
| \'[^\']*\' # or single quoted value,
| [\w\-.:]+ # or non quoted value.
) # End group of attrib value alts.
)? # Attribute value is optional.
) # End group of attribute alternatives.
)* # Zero or more attributes.
\s* # Optional whitespace before close >
/? # Match "empty elements" too.
> # Closing > of start tag.
%ix';
$elementcount = preg_match_all($re, $text, $matches);
if ($elementcount) {
printf("%d HTML start tags found:\n", $elementcount);
for ($i = 0; $i < $elementcount; ++$i) {
printf("Tag[%d] = \"%s\"\n", $i + 1, $matches[1][$i]);
// Print CLASS attribute from capture group $2
if (isset($matches[2][$i]) && $matches[2][$i]) {
printf("\tCLASS = {%s}\n", $matches[2][$i]);
} else {
printf("\tTag has no CLASS attribute.\n");
}
// Print ID attribute from capture group $3
if (isset($matches[3][$i]) && $matches[3][$i]) {
printf("\tID = {%s}\n", $matches[3][$i]);
} else {
printf("\tTag has no ID attribute.\n");
}
// Print TITLE attribute from capture group $4
if (isset($matches[4][$i]) && $matches[4][$i]) {
printf("\tTITLE = {%s}\n", $matches[4][$i]);
} else {
printf("\tTag has no TITLE attribute.\n");
}
}
} else {
printf("No HTML start tags found.\n");
}
}
$data = file_get_contents('testdata.html');
printTagAttributes($data);
?>
这是一个有效的HTML 4.01 STRICT
测试文件:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Test printTagAttributes()</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<h1 class="H1 CLASS" id="H1_ID" title="H1 TITLE">
Test printTagAttributes()
</h1>
<h3 class="class-name">Blah blah</h3>
<p title="P1 TITLE" id="P1_ID" class="P1 CLASS" >
Paragraph 1 has attributes in reverses order.
</p>
<p class=P2_CLASS id=P2_ID title=P2_TITLE>
Paragraph 2 has attributes specified with unquoted values.
</p>
<!-- StackOverflow highlighter chokes on the following title -->
<p title='This title has <i>an embedded "non-tag"</i>!'>
Paragraph 3 has a TITLE attribute value containing
both double quotes and angle brackets. This one will
trip up many regexes!
</p>
</body>
</html>
通过脚本运行上述测试文件时的输出:
#use python raw string to preserve spacing...
output=r'''
10 HTML start tags found:
Tag[1] = "html"
Tag has no CLASS attribute.
Tag has no ID attribute.
Tag has no TITLE attribute.
Tag[2] = "head"
Tag has no CLASS attribute.
Tag has no ID attribute.
Tag has no TITLE attribute.
Tag[3] = "title"
Tag has no CLASS attribute.
Tag has no ID attribute.
Tag has no TITLE attribute.
Tag[4] = "meta"
Tag has no CLASS attribute.
Tag has no ID attribute.
Tag has no TITLE attribute.
Tag[5] = "body"
Tag has no CLASS attribute.
Tag has no ID attribute.
Tag has no TITLE attribute.
Tag[6] = "h1"
CLASS = {H1 CLASS}
ID = {H1_ID}
TITLE = {H1 TITLE}
Tag[7] = "h3"
CLASS = {class-name}
Tag has no ID attribute.
Tag has no TITLE attribute.
Tag[8] = "p"
CLASS = {P1 CLASS}
ID = {P1_ID}
TITLE = {P1 TITLE}
Tag[9] = "p"
CLASS = {P2_CLASS}
ID = {P2_ID}
TITLE = {P2_TITLE}
Tag[10] = "p"
Tag has no CLASS attribute.
Tag has no ID attribute.
TITLE = {This title has <i>an embedded "non-tag"</i>!}
'''
这个试图匹配开始标记的正则表达式没有考虑(非常规)HTML标记语言的完整复杂性。有许多方式可以将其绊倒:例如CDATA部分,注释,脚本和样式都可能导致问题。虽然 的情况下,使用HTML正则表达式是合适的,但这些情况很少见。
答案 1 :(得分:1)
ridgerunner已正确分析您的问题。要了解为什么即使懒惰版本失败:
<h3 class="class-name">Blah blah</h3>
^< # matches until the first < --> OK
[^>]+? # matches nothing by default --> OK
(?:class=['\"]([^\"]+)['\"])? # doesn't match here, but is optional --> OK
[^>]* # matches until the end of the tag --> OK
> # matches the closing > --> Match!
官方解决方案当然是使用HTML解析器。但在您的情况下,您可以通过扩展可选组的范围来解决问题:
^<(?:[^>]*class=['\"]([^\"]+)['\"])?[^>]*>
现在(?:...)
组首先尝试所有位置以进行有效匹配。