Question

我正在尝试仅匹配html片段中特定标记的连续出现。对于测试字符串“blah BAD blah blah blah Time Warner Satan. The blah ..”，我想只匹配“时间”，“华纳”和“撒旦”（作为单独的字符串或一组，无关紧要），但不是“坏”。

到目前为止，我最接近的尝试是((?P<match>.*?)[\s\.]){2,}，它给了我'撒旦'。至少它似乎是强制执行2或更多，但不返回该匹配中的所有内容。我猜测一个涉及积极前瞻的解决方案是我需要的，但我似乎无法随意使用。

我查看了其他各种相关问题，但似乎找不到合适的解决方案。大多数相关问题只是填充了答案，说明HTML永远不应该使用正则表达式进行解析，而不是回答问题。我对lxml / BeautifulSoup解决方案感到满意，只要它强制执行我的要求的顺序属性，但我最感兴趣的是正则表达式，即使只是从好奇的角度来看。我知道我正在寻找的东西必须是正则表达式。

感谢您的帮助和意见。

编辑：我已经意识到我可以通过使用更简单的方法来解决这个问题，方法是将标记的所有实例与(?P<match>.*?)匹配，迭代每个匹配对象并比较每个匹配对象的开始和结束位置比赛。它有用，但我宁愿找一个更整洁的解决方案。

Answer 1

如果您对重新解决方案感到好奇，可能会这样：

html = "blah <em>BAD</em> blah blah blah <em>Time</em> <em>Warner</em> <em>Satan</em>. The blah .."

rx = r"""(?x)          # extended mode - enable comments
    (                  # match a tag
        <em            # tag name
          [^<>]*       # maybe also attributes
        >              # open tag matched
        (              # now match the tag body
            (?<!</em)  # there must be no closing tag before a character
            .          # a body character
        ) *            # some more characters like this
        </em>          # closing tag
        \s*            # maybe some spaces after it
    ){2,}              # repeat the whole thing twice or more
"""

print re.sub(rx, r'{{\g<0>}}', html)
# blah <em>BAD</em> blah blah blah {{<em>Time</em> <em>Warner</em> <em>Satan</em>}}. The blah ..

Python正则表达式匹配顺序html标签

1 个答案: