Question

如何使用正则表达式检索html代码段中的所有html标记名称？如果重要的话，我正在使用PHP来做这件事。例如：

<div id="someid">
     <img src="someurl" />
     <br />
     <p>some content</p>
</div>

应该返回：div，img，br，p。

Answer 1

这应该适用于大多数格式正确的标记，只要你不在CDATA部分并且没有玩过令人讨厌的游戏重新定义实体：

# nasty, ugly, illegible, unmaintable — NEVER USE THIS STYLE!!!!
/<\w+(?:\s+\w+=(?:\S+|(['"])(?:(?!\1).)*?\1))*\s*\/?>/s

或更清晰，如

# broken out into related elements grouped by whitespace via /x
/ < \w+ (?: \s+ \w+ = (?: \S+ | (['"]) (?: (?! \1) . ) *? \1 )) * \s* \/? > /xs

并且更加清晰：

/ 
   # start of tag, with named ident
   < \w+ 
   # now with unlimited k=v pairs 
   #    where k is \w+ 
   #      and v is either \S+ or else quoted 
   (?: \s+ \w+ = (?: \S+        # either an unquoted value, 
                   | ( ['"] )   # or else first pick either quote
                     (?: 
                        (?! \1) .  # anything that isn't our quote, including brackets
                     ) * ?     # maximal should probably work here
                     \1        # till we see it again
                 ) 
   )  *    # as many k=v pairs as we can find
   \s *    # tolerate closing whitespace

   \/ ?    # XHTML style close tag
   >       # finally done
/xs

你可以添加一些污点，比如在我不在上面的几个地方容忍空白。

PHP不一定是这类工作的最佳语言，尽管你可以在紧要关头做。至少，你应该将这些东西隐藏在一个函数和/或变量的某个地方，不要让它暴露在所有裸体状态，考虑到The Children Are Watching™。

要做更复杂的事情，除了找到哦，我不知道字母或空格，模式从评论和空白中获益匪浅。这应该是不言而喻的，但由于某些原因，人们忘记使用/x进行认知分块，让空白分组相关事情就像使用命令式代码一样。

尽管它们是声明性程序而不是命令式程序，但更多的模式也可以从完整的问题分解和自上而下的设计中受益。实现这一点的一种方法是你有“正则表达式子程序”你单独声明你使用它们的位置。否则你只是做剪切和粘贴代码重用，这是代码重用的重复排序。以下是匹配<img>标记的示例模式，这次使用真正的Perl：

my $img_rx = qr{

    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )

    # remainder is pure declaration
    (?(DEFINE)

        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )

        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<required_attribute>
            alt
          | src
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.

        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )

        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )

        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )

        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<might_white>     \s *   )

        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

    )

}six;

是的，它变长了，但是通过越来越长它变得更易于维护，而不是更少。 它也更正确。现在，它使用的真实程序不仅仅是这个，因为你必须比实际HTML中的程序要多得多，例如CDATA和编码实体的顽皮重新定义。然而，与流行的看法相反，你可以实际上用PHP做这种事情，因为它使用PCRE，它允许(?(DEFINE)...)块和递归模式。在我的答案here，here，here，here和here中，我有更严肃的例子。

好的，好的，你读过所有这些，或者至少看了一眼吗？还在我这儿？你好？？别忘了呼吸。那里，你现在好了。：）

当然，有一个很大的灰色区域可能会让位于不可取的地方，并且远远超过不可能的结果。如果这些答案中的那些例子，更不用说当前的答案中的这些例子，超出了你当前的模式匹配技能水平，那么你可能应该使用别的东西，这通常意味着让其他人为你做这件事。

Answer 2

我想这应该有用......我会在一分钟内尝试一下：

修改：已移除\s+（感谢Peteris）

preg_match_all('/<(\w+)[^>]*>/', $html, $matched_elements);

Answer 3

正则表达式可能并不总是有效。如果你100％确定它是格式良好的XHTML，那么正则表达式可能是一种方法。如果没有，请使用某种PHP库来完成它。在C＃中，有一种称为HTML Agility Pack的东西，http://htmlagilitypack.codeplex.com，例如见How do I parse HTML using regular expressions in C#?。也许在PHP中有一个等效的工具。

Answer 4

在python中，一种解决方案是使用正则表达式在html中获取所有不同的标记名称。

import re

s = """<div id="someid">
       <img src="someurl" />
       <br />
       <p>some content</p>
       </div>
    """

print(set(re.findall('<(\w+)', s)))
# {'p', 'img', 'div', 'br'}
or 
print({i.replace('<', '') for i in re.findall('(<\w+)',s)})
# {'p', 'img', 'div', 'br'}

正则表达式 - 仅在HTML中匹配标记名称

4 个答案: