正则表达式在可选时不匹配子表达式

时间:2011-12-12 16:59:08

标签: php regex

我正在匹配HTML字符串以返回第一个HTML元素并检查它是否具有class属性。我的测试HTML字符串是:

<h3 class="class-name">Blah blah</h3>

以下正则表达式:

/^<[^>]+(?:class=['\"]([^\"]+)['\"])[^>]*>/

返回以下匹配项:

[0] = <h3 class="class-name">
[1] = class-name

但是,只要我将“class”子表达式设为可选:

/^<[^>]+(?:class=['\"]([^\"]+)['\"])?[^>]*>/

我放弃了第二次“类名”比赛:

[0] = <h3 class="class-name">

有人能告诉我我做错了吗?

2 个答案:

答案 0 :(得分:2)

有趣的问题。发生的事情是第一个:[^>]+贪婪地匹配到结束>。然后它尝试匹配:(?:class...)?失败。但是,由于此和后面的[^>]*都可以完全匹配,因此匹配结束>并声明成功匹配(没有捕获任何内容)。

有趣的是,即使第一个贪婪的表达式变得懒惰,这个“捕获组中没有”行为也会发生:[^>]+?

但是你真的做错了是试图用正则表达式解析HTML!

其他

为了说明为什么使用正则表达式来解析HTML并不是一个好主意,请考虑以下有效HTML标记的特性:

  • 属性值可以是双引号,并且可以包含单引号和尖括号,例如<b title="That's <i>entertainment!</i>">bold stuff</b>。这里的title属性值包含看起来像标签但不是标签的文本。
  • 属性值可以是单引号,并且可以包含双引号和尖括号,例如<b title='<i class="this is not a class">is this inside an I element? NO!</i>'>bold stuff</b>
  • 属性值可以不引用,例如<b class=myclass>bold stuff</b>
  • 属性值是可选的,例如<option selected>

HTML标记的许多其他方面可能会破坏正则表达式(请参阅下面的警告),但我们首先考虑上述方面。

类似(但稍微复杂一点)的问题:

让我们说除了CLASS属性值之外,您还希望捕获元素名称以及ID和TITLE属性的值。所有三个属性都是可选的。标记名称将始终在$1中捕获,如果存在任何CLASS,ID或TITLE属性值,则会在$2$3$4中捕获这些属性值分别。这些属性可以按任何顺序出现,也可以混合使用任意数量的其他属性,每个属性都有或没有值,值可以是双引号,单引号或非引号。

使用Perl / PHP / PCRE正则表达式可以(不完美)完成此(请参阅下面的注意事项),但它很长且很复杂,需要使用(?|...|...) 分支重置构造。这是:

<?php
// Match HTML start tags and print CLASS, ID and TITLE attributes.
//   Note that this method is not 100% reliable and can easily fail.
function printTagAttributes($text) {
    $re = '%# Match HTML start tags. Capture CLASS, ID and TITLE values.
        <                   # Opening < of start tag.
        (\w+)               # $1: Element name.
        (?:                 # Group for zero or more attributes.
          \s+               # Required whitespace before attribute.
          (?:               # Group for attribute alternatives.
            class\s*=\s*    # Match any CLASS attribute value in $2.
            (?|             # Branch reset group for $2: value.
              "([^"]*)"     # $2.1: Double quoted value or,
            | \'([^\']*)\'  # $2.2: Single quoted value or,
            | ([\w\-.:]+)   # $2.3: Non quoted value.
            )               # End branch reset group.
          | id\s*=\s*       # Match any ID attribute value in $3.
            (?|             # Branch reset group for $3: value.
              "([^"]*)"     # $3.1: Double quoted value or,
            | \'([^\']*)\'  # $3.2: Single quoted value or,
            | ([\w\-.:]+)   # $3.3: Non quoted value.
            )               # End branch reset group.
          | title\s*=\s*    # Match any TITLE attribute value in $4.
            (?|             # Branch reset group for $4: value.
              "([^"]*)"     # $4.1: Double quoted value or,
            | \'([^\']*)\'  # $4.2: Single quoted value or,
            | ([\w\-.:]+)   # $4.3: Non quoted value.
            )               # End branch reset group.
          | [\w\-.:]+       # or match any other attribute.
            (?:             # Group for optional attrib value.
              \s*=\s*       # Name and value separated by =
              (?:           # Group for attrib value alternatives.
                "[^"]*"     # Either Double quoted value,
              | \'[^\']*\'  # or single quoted value,
              | [\w\-.:]+   # or non quoted value.
              )             # End group of attrib value alts.
            )?              # Attribute value is optional.
          )                 # End group of attribute alternatives.
        )*                  # Zero or more attributes.
        \s*                 # Optional whitespace before close >
        /?                  # Match "empty elements" too.
        >                   # Closing > of start tag.
        %ix';
    $elementcount = preg_match_all($re, $text, $matches);
    if ($elementcount) {
        printf("%d HTML start tags found:\n", $elementcount);
        for ($i = 0; $i < $elementcount; ++$i) {
            printf("Tag[%d] = \"%s\"\n", $i + 1, $matches[1][$i]);
            // Print CLASS attribute from capture group $2
            if (isset($matches[2][$i]) && $matches[2][$i]) {
                printf("\tCLASS  = {%s}\n", $matches[2][$i]);
            } else {
                printf("\tTag has no CLASS attribute.\n");
            }
            // Print ID attribute from capture group $3
            if (isset($matches[3][$i]) && $matches[3][$i]) {
                printf("\tID     = {%s}\n", $matches[3][$i]);
            } else {
                printf("\tTag has no ID attribute.\n");
            }
            // Print TITLE attribute from capture group $4
            if (isset($matches[4][$i]) && $matches[4][$i]) {
                printf("\tTITLE  = {%s}\n", $matches[4][$i]);
            } else {
                printf("\tTag has no TITLE attribute.\n");
            }
        }
    } else {
        printf("No HTML start tags found.\n");
    }
}

$data = file_get_contents('testdata.html');
printTagAttributes($data);
?>

这是一个有效的HTML 4.01 STRICT测试文件:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
    <title>Test printTagAttributes()</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<h1 class="H1 CLASS" id="H1_ID" title="H1 TITLE">
    Test printTagAttributes()
</h1>
<h3 class="class-name">Blah blah</h3>
<p title="P1 TITLE" id="P1_ID" class="P1 CLASS" >
  Paragraph 1 has attributes in reverses order.
</p>
<p class=P2_CLASS id=P2_ID title=P2_TITLE>
  Paragraph 2 has attributes specified with unquoted values.
</p>
<!-- StackOverflow highlighter chokes on the following title -->
<p title='This title has <i>an embedded "non-tag"</i>!'>
  Paragraph 3 has a TITLE attribute value containing
  both double quotes and angle brackets. This one will
  trip up many regexes!
</p>
</body>
</html>

通过脚本运行上述测试文件时的输出:

#use python raw string to preserve spacing...
output=r'''
10 HTML start tags found:
Tag[1] = "html"
        Tag has no CLASS attribute.
        Tag has no ID attribute.
        Tag has no TITLE attribute.
Tag[2] = "head"
        Tag has no CLASS attribute.
        Tag has no ID attribute.
        Tag has no TITLE attribute.
Tag[3] = "title"
        Tag has no CLASS attribute.
        Tag has no ID attribute.
        Tag has no TITLE attribute.
Tag[4] = "meta"
        Tag has no CLASS attribute.
        Tag has no ID attribute.
        Tag has no TITLE attribute.
Tag[5] = "body"
        Tag has no CLASS attribute.
        Tag has no ID attribute.
        Tag has no TITLE attribute.
Tag[6] = "h1"
        CLASS  = {H1 CLASS}
        ID     = {H1_ID}
        TITLE  = {H1 TITLE}
Tag[7] = "h3"
        CLASS  = {class-name}
        Tag has no ID attribute.
        Tag has no TITLE attribute.
Tag[8] = "p"
        CLASS  = {P1 CLASS}
        ID     = {P1_ID}
        TITLE  = {P1 TITLE}
Tag[9] = "p"
        CLASS  = {P2_CLASS}
        ID     = {P2_ID}
        TITLE  = {P2_TITLE}
Tag[10] = "p"
        Tag has no CLASS attribute.
        Tag has no ID attribute.
        TITLE  = {This title has <i>an embedded "non-tag"</i>!}
'''

注意事项:

这个试图匹配开始标记的正则表达式没有考虑(非常规)HTML标记语言的完整复杂性。有许多方式可以将其绊倒:例如CDATA部分,注释,脚本和样式都可能导致问题。虽然 的情况下,使用HTML正则表达式是合适的,但这些情况很少见。

答案 1 :(得分:1)

ridgerunner已正确分析您的问题。要了解为什么即使懒惰版本失败:

 <h3 class="class-name">Blah blah</h3>
^<                               # matches until the first <           --> OK
  [^>]+?                         # matches nothing by default          --> OK
  (?:class=['\"]([^\"]+)['\"])?  # doesn't match here, but is optional --> OK
  [^>]*                          # matches until the end of the tag    --> OK
                       >         # matches the closing >               --> Match!

官方解决方案当然是使用HTML解析器。但在您的情况下,您可以通过扩展可选组的范围来解决问题:

^<(?:[^>]*class=['\"]([^\"]+)['\"])?[^>]*>

现在(?:...)组首先尝试所有位置以进行有效匹配。