Question

披露：我在这里已多次阅读this answer，我知道比使用正则表达式解析HTML更好。这个问题只是为了扩展我的正则表达式。

说我有这个字符串：

some text <tag link="fo>o"> other text

我希望匹配整个代码，但如果我使用<[^>]+>，则只匹配<tag link="fo>。

如何确保引号内的>可以忽略。

我可以用一个while循环来编写一个解析器来执行此操作，但我想知道如何使用正则表达式。

Answer 1

正则表达式：

<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>

在线演示：

http://regex101.com/r/yX5xS8

完整说明：

我知道这个正则表达式看起来很头疼，所以这是我的解释：

<                      # Open HTML tags
    [^>]*?             # Lazy Negated character class for closing HTML tag
    (?:                # Open Outside Non-Capture group
        (?:            # Open Inside Non-Capture group
            ('|")      # Capture group for quotes, backreference group 1
            [^'"]*?    # Lazy Negated character class for quotes
            \1         # Backreference 1
        )              # Close Inside Non-Capture group
        [^>]*?         # Lazy Negated character class for closing HTML tag
    )*                 # Close Outside Non-Capture group
>                      # Close HTML tags

Answer 2

Vasili Syrakis的回答略有改善。它完全单独处理"…"和'…'，并且不使用*?限定符。

正则表达式

<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>

演示

http://regex101.com/r/jO1oQ1

解释

<                    # start of HTML tag
    [^'">]*          #   any non-single, non-double quote or greater than
    (                #   outer group
        (            #     inner group
            "[^"]*"  #       "..."
        |            #      or
            '[^']*'  #       '...'
        )            #
        [^'">]*      #   any non-single, non-double quote or greater than
    )*               #   zero or more of outer group
>                    # end of HTML tag

这个版本稍微比Vasilis更好，因为"…"中允许使用单引号，'…'内允许使用双引号，而且（不正确）标签就像<{1}}将匹配。

它比Vasili的解决方案稍微更糟，因为这些组被捕获了。如果您不想这样做，请在所有地方将<a href='>替换为(。（只需使用(?:可以缩短正则表达式，并使其更具可读性。

Answer 3

(<.+?>[^<]+>)|(<.+?>)

你可以制作两个正则表达式，而不是使用＆＃39; |＆＃39;来制作它们，在这种情况下：

(<.+?>[^<]+>)   #will match  some text <tag link="fo>o"> other text
(<.+?>)         #will match  some text <tag link="foo"> other text

如果第一个案例匹配，它将不会使用第二个正则表达式，因此请确保将特殊情况放在第一位。

Answer 4

如果您希望使用转义双引号，请尝试：

/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g

例如：

const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
    return exec ? exec.index : -1;
})(gtExp.exec(xml));

如果您正在解析一堆XML，那么您需要设置.lastIndex。

gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes

RegEx：如果某个字符在引号内，则不匹配

4 个答案:

正则表达式：

在线演示：

完整说明：

正则表达式

演示

解释