Question

我正在匹配HTML字符串以返回第一个HTML元素并检查它是否具有class属性。我的测试HTML字符串是：

<h3 class="class-name">Blah blah</h3>

以下正则表达式：

/^<[^>]+(?:class=['\"]([^\"]+)['\"])[^>]*>/

返回以下匹配项：

[0] = <h3 class="class-name">
[1] = class-name

但是，只要我将“class”子表达式设为可选：

/^<[^>]+(?:class=['\"]([^\"]+)['\"])?[^>]*>/

我放弃了第二次“类名”比赛：

[0] = <h3 class="class-name">

有人能告诉我我做错了吗？

Answer 1

有趣的问题。发生的事情是第一个：[^>]+贪婪地匹配到结束>。然后它尝试匹配：(?:class...)?失败。但是，由于此和后面的[^>]*都可以完全匹配，因此匹配结束>并声明成功匹配（没有捕获任何内容）。

有趣的是，即使第一个贪婪的表达式变得懒惰，这个“捕获组中没有”行为也会发生：[^>]+?。

但是你真的做错了是试图用正则表达式解析HTML！

其他

为了说明为什么使用正则表达式来解析HTML并不是一个好主意，请考虑以下有效HTML标记的特性：

属性值可以是双引号，并且可以包含单引号和尖括号，例如<b title="That's <i>entertainment!</i>">bold stuff</b>。这里的title属性值包含看起来像标签但不是标签的文本。
属性值可以是单引号，并且可以包含双引号和尖括号，例如<b title='<i class="this is not a class">is this inside an I element? NO!</i>'>bold stuff</b>。
属性值可以不引用，例如<b class=myclass>bold stuff</b>
属性值是可选的，例如<option selected>

HTML标记的许多其他方面可能会破坏正则表达式（请参阅下面的警告），但我们首先考虑上述方面。

类似（但稍微复杂一点）的问题：

让我们说除了CLASS属性值之外，您还希望捕获元素名称以及ID和TITLE属性的值。所有三个属性都是可选的。标记名称将始终在$1中捕获，如果存在任何CLASS，ID或TITLE属性值，则会在$2，$3和$4中捕获这些属性值分别。这些属性可以按任何顺序出现，也可以混合使用任意数量的其他属性，每个属性都有或没有值，值可以是双引号，单引号或非引号。

使用Perl / PHP / PCRE正则表达式可以（不完美）完成此（请参阅下面的注意事项），但它很长且很复杂，需要使用(?|...|...) 分支重置构造。这是：

<?php // Match HTML start tags and print CLASS, ID and TITLE attributes. // Note that this method is not 100% reliable and can easily fail. function printTagAttributes($text) { $re = '%# Match HTML start tags. Capture CLASS, ID and TITLE values. < # Opening < of start tag. (\w+) # $1: Element name. (?: # Group for zero or more attributes. \s+ # Required whitespace before attribute. (?: # Group for attribute alternatives. class\s*=\s* # Match any CLASS attribute value in $2. (?| # Branch reset group for $2: value. "([^"]*)" # $2.1: Double quoted value or, | \'([^\']*)\' # $2.2: Single quoted value or, | ([\w\-.:]+) # $2.3: Non quoted value. ) # End branch reset group. | id\s*=\s* # Match any ID attribute value in $3. (?| # Branch reset group for $3: value. "([^"]*)" # $3.1: Double quoted value or, | \'([^\']*)\' # $3.2: Single quoted value or, | ([\w\-.:]+) # $3.3: Non quoted value. ) # End branch reset group. | title\s*=\s* # Match any TITLE attribute value in $4. (?| # Branch reset group for $4: value. "([^"]*)" # $4.1: Double quoted value or, | \'([^\']*)\' # $4.2: Single quoted value or, | ([\w\-.:]+) # $4.3: Non quoted value. ) # End branch reset group. | [\w\-.:]+ # or match any other attribute. (?: # Group for optional attrib value. \s*=\s* # Name and value separated by = (?: # Group for attrib value alternatives. "[^"]*" # Either Double quoted value, | \'[^\']*\' # or single quoted value, | [\w\-.:]+ # or non quoted value. ) # End group of attrib value alts. )? # Attribute value is optional. ) # End group of attribute alternatives. )* # Zero or more attributes. \s* # Optional whitespace before close > /? # Match "empty elements" too. > # Closing > of start tag. %ix'; $elementcount = preg_match_all($re, $text, $matches); if ($elementcount) { printf("%d HTML start tags found:\n", $elementcount); for ($i = 0; $i < $elementcount; ++$i) { printf("Tag[%d] = \"%s\"\n", $i + 1, $matches[1][$i]); // Print CLASS attribute from capture group $2 if (isset($matches[2][$i]) && $matches[2][$i]) { printf("\tCLASS = {%s}\n", $matches[2][$i]); } else { printf("\tTag has no CLASS attribute.\n"); } // Print ID attribute from capture group $3 if (isset($matches[3][$i]) && $matches[3][$i]) { printf("\tID = {%s}\n", $matches[3][$i]); } else { printf("\tTag has no ID attribute.\n"); } // Print TITLE attribute from capture group $4 if (isset($matches[4][$i]) && $matches[4][$i]) { printf("\tTITLE = {%s}\n", $matches[4][$i]); } else { printf("\tTag has no TITLE attribute.\n"); } } } else { printf("No HTML start tags found.\n"); } } $data = file_get_contents('testdata.html'); printTagAttributes($data); ?>

这是一个有效的HTML 4.01 STRICT测试文件：

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>Test printTagAttributes()</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <h1 class="H1 CLASS" id="H1_ID" title="H1 TITLE"> Test printTagAttributes() </h1> <h3 class="class-name">Blah blah</h3> <p title="P1 TITLE" id="P1_ID" class="P1 CLASS" > Paragraph 1 has attributes in reverses order. </p> <p class=P2_CLASS id=P2_ID title=P2_TITLE> Paragraph 2 has attributes specified with unquoted values. </p>  <p title='This title has <i>an embedded "non-tag"</i>!'> Paragraph 3 has a TITLE attribute value containing both double quotes and angle brackets. This one will trip up many regexes! </p> </body> </html>

通过脚本运行上述测试文件时的输出：

#use python raw string to preserve spacing... output=r''' 10 HTML start tags found: Tag[1] = "html" Tag has no CLASS attribute. Tag has no ID attribute. Tag has no TITLE attribute. Tag[2] = "head" Tag has no CLASS attribute. Tag has no ID attribute. Tag has no TITLE attribute. Tag[3] = "title" Tag has no CLASS attribute. Tag has no ID attribute. Tag has no TITLE attribute. Tag[4] = "meta" Tag has no CLASS attribute. Tag has no ID attribute. Tag has no TITLE attribute. Tag[5] = "body" Tag has no CLASS attribute. Tag has no ID attribute. Tag has no TITLE attribute. Tag[6] = "h1" CLASS = {H1 CLASS} ID = {H1_ID} TITLE = {H1 TITLE} Tag[7] = "h3" CLASS = {class-name} Tag has no ID attribute. Tag has no TITLE attribute. Tag[8] = "p" CLASS = {P1 CLASS} ID = {P1_ID} TITLE = {P1 TITLE} Tag[9] = "p" CLASS = {P2_CLASS} ID = {P2_ID} TITLE = {P2_TITLE} Tag[10] = "p" Tag has no CLASS attribute. Tag has no ID attribute. TITLE = {This title has <i>an embedded "non-tag"</i>!} '''

注意事项：

这个试图匹配开始标记的正则表达式没有考虑（非常规）HTML标记语言的完整复杂性。有许多方式可以将其绊倒：例如CDATA部分，注释，脚本和样式都可能导致问题。虽然的情况下，使用HTML正则表达式是合适的，但这些情况很少见。

Answer 2

ridgerunner已正确分析您的问题。要了解为什么即使懒惰版本失败：

 <h3 class="class-name">Blah blah</h3>
^<                               # matches until the first <           --> OK
  [^>]+?                         # matches nothing by default          --> OK
  (?:class=['\"]([^\"]+)['\"])?  # doesn't match here, but is optional --> OK
  [^>]*                          # matches until the end of the tag    --> OK
                       >         # matches the closing >               --> Match!

官方解决方案当然是使用HTML解析器。但在您的情况下，您可以通过扩展可选组的范围来解决问题：

^<(?:[^>]*class=['\"]([^\"]+)['\"])?[^>]*>

现在(?:...)组首先尝试所有位置以进行有效匹配。

正则表达式在可选时不匹配子表达式

2 个答案:

其他

类似（但稍微复杂一点）的问题：

注意事项：