任何人都可以解释这个正则表达式

时间:2012-10-04 07:31:17

标签: regex html-parsing

我只是需要有人来纠正我对这个正则表达式的理解,这就像是一个匹配HTML标签的权宜之计。

< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >

我的理解 -

  • < - 匹配标记打开符号
  • (?: - 无法理解这里发生了什么。这些符号意味着什么?
  • "[^"]*['"]*双引号中的任意字符串。还有什么东西要来吗?
  • '[^']*'['"]* - 单引号中的一些字符串
  • [^'">] - 除“”&gt;。
  • 以外的任何字符

所以它是'&lt;'符号。用双引号或单引号中的字符串或任何其他包含'“或&gt;的字符串,重复一次或多次,后跟'&gt;' 。
这是我能做到的最好的。

3 个答案:

答案 0 :(得分:5)

<       # literally just an opening tag followed by a space
(       # the bracket opens a subpattern, it's necessary as a boundary for
        # the | later on
?:      # makes the just opened subpattern non-capturing (so you can't access it
        # as a separate match later
"       # literally "
[^"]    # any character but " (this is called a character class)
*       # arbitrarily many of those (as much as possible)
"       # literally "
['"]    # either ' or "
*       # arbitrarily many of those (and possible alternating! it doesn't have
        # to be the same character for the whole string)
|       # OR
'       # literral *
[^']    # any character but ' (this is called a character class)
*       # arbitrarily many of those (as much as possible)
'       # literally "
['"]*   # as above
|       # OR
[^'">]  # any character but ', ", >
)       # closes the subpattern
+       # arbitrarily many repetitions but at least once
>       # closing tag

请注意,正则表达式中的所有空格都被视为与任何其他字符一样。它们恰好匹配一个空间。

还要特别注意字符类开头的^。它不被视为一个单独的字符,而是反转整个字符类。

我也可以(强制性地)添加正则表达式are not appropriate to parse HTML.

答案 1 :(得分:2)

|分开,表示or s:

<
  (?:
    "[^"]*" ['"]* |
    '[^']*'['"]* |
    [^'">]
  )+
>

(?:表示不匹配的组。该组的内部匹配这些内容(按此顺序):

  1. "stuff"
  2. 'stuff'
  3. asd=
  4. 实际上,这是一个试图将HTML标记与属性匹配的正则表达式。

答案 2 :(得分:0)

以下是YAPE :: Regex :: Explain

的结果
(?-imsx:< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  <                        '< '
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
     "                       ' "'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '" '
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
     '                       ' \''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [^'">]                   any character except: ''', '"', '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
   >                       ' >'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------