Question

正则表达式：

<span style='.+?'>TheTextToFind</span>

HTML：

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span></span>

为什么匹配包含此内容？

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED

Example Link

Answer 1

正则表达式引擎始终找到最左侧匹配。这就是你得到的原因

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span>

作为比赛。（基本上整个输入，没有最后</span>）。

为了使引擎朝着正确的方向转向，如果我们假设>没有直接出现在属性中，则以下正则表达式将与您想要的匹配。

<span style='[^>]+'>TheTextToFind</span>

此正则表达式符合您的要求，因为根据上述假设，[^>]+无法在标记之外匹配。

然而，我希望您不要将此作为从HTML页面中提取信息的程序的一部分。为此目的使用HTML解析器。

要理解为什么正则表达式匹配，您需要了解.+?将尝试回溯，以便它可以找到续集（'>TheTextToFind</span>）的匹配项

# Matching .+?
# Since +? is lazy, it matches . once (to fulfill the minimum repetition), and
# increase the number of repetition if the sequel fails to match
<span style='f                        # FAIL. Can't match closing '
<span style='fo                       # FAIL. Can't match closing '
...
<span style='font-size:11.0pt;        # PROCEED. But FAIL later, since can't match T in The
<span style='font-size:11.0pt;'       # FAIL. Can't match closing '
...
<span style='font-size:11.0pt;'>DON'  # PROCEED. But FAIL later, since can't match closing >
...
<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='
                                      # PROCEED. But FAIL later, since can't match closing >
...
<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;
                                      # PROCEED. MATCH FOUND.

正如您所看到的，.+?尝试增加长度并匹配font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;，这样可以匹配续集 '>TheTextToFind</span>。

RegEx HTML与懒惰通配符匹配太多

1 个答案: