Question

我想使用compile运行子程序。我有子逻辑工作（参见下面的第一个例子），所以函数打印＆＃34;，花生，＆＃34;如预期的。基本上，sub会强调任何长度的html标签，有或没有属性，并替换为＆＃34;，＆＃34;。但是，我想让第二个版本正常工作，虽然更容易修改，因为要添加一个新的强调标记，我添加＆＃34;标记名＆＃34;而不是＆＃34; |标记名| /标记名＆＃34;分别用于打开和关闭版本。我知道答案是以某种方式使用编译。我搜索过，找不到答案。

使用：

def cut_out_emphasis():
    str1 = "<b class='boldclass'>peanut</b>
    str1 = re.sub(r'<(b|\/b|i|\/i|ul|\/ul)[^>]*>', ', ', str1, flags=re.I)
    print str1

不能工作：

def cut_out_emphasis():
    str1 = "<b class='boldclass'>peanut</b>
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = '%s|%s|/%s' % (str2, x, x)
    str2 = "r'<(%s)[^>]*>'" % (str2, )
    str1 = re.sub(re.compile(str2, re.IGNORECASE), ', ', str1, flags=re.I)
    print str1

Answer 1

首先，所有非平凡的正则表达式都应该以自由间隔模式编写，并带有适当的缩进和大量注释。这样做可以让您轻松查看和编辑要剥离的标记列表 - （即不需要将标记放在固定的常量列表中 - 就像在正则表达式中添加一行一样容易）。 / p>

import re
re_strip_tags = re.compile(r"""
    # Match open or close tag from list of tag names.
    <                # Tag opening "<" delimiter.
    /?               # Match close tags too (which begin with "</").
    (?:              # Group list of tag names.
      b              # Bold tags.
    | i              # Italic tags.
    | em             # Emphasis tags.
#   | othertag       # Add additional tags here...
    )                # End list of tag names.
    (?:              # Non-capture group for optional attribute(s).
      \s+            # Attributes must be separated by whitespace.
      [\w\-.:]+      # Attribute name is required for attr=value pair.
      (?:            # Non-capture group for optional attribute value.
        \s*=\s*      # Name and value separated by "=" and optional ws.
        (?:          # Non-capture group for attrib value alternatives.
          "[^"]*"    # Double quoted string (Note: may contain "&<>").
        | '[^']*'    # Single quoted string (Note: may contain "&<>").
        | [\w\-.:]+  # Non-quoted attrib value can be A-Z0-9-._:
        )            # End of attribute value
      )?             # Attribute value is optional.
    )*               # Zero or more tag attributes.
    \s* /?           # Optional whitespace and "/" before ">".
    >                # Tag closing ">" delimiter.
    """, re.VERBOSE | re.IGNORECASE)

def cut_out_emphasis(str):
    return re.sub(re_strip_tags, ', ', str)

print (cut_out_emphasis("<b class='boldclass'>peanut</b>"))

编写Python正则表达式时，为了完全避免转义/反斜杠的任何问题，请始终使用：r"raw string"或r"""raw multi-line string"""语法。请注意，在上面的脚本中，正则表达式只编译一次，但可以多次使用。（但是，还应该注意，这不是一个优势，因为Python内部缓存编译的正则表达式。）

还应该提到的是，使用正则表达式解析HTML是，让我们说; 通常不赞成。

Answer 2

编译的RE并不像那样工作。你应该这样做：

def cut_out_emphasis(str1):
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = r'%s|%s|\/%s' % (str2, x, x)
    str2 = r'<(%s)[^>]*>' % (str2, )
    re_compiled = re.compile(str2, re.IGNORECASE)
    str1 = re_compiled.sub(', ', str1)
    return str1

但是它可以编译正则表达式。如果多次使用相同的正则表达式，它可以提高性能。在你的情况下，你可以坚持这一点：

def cut_out_emphasis(str1):
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = r'%s|%s|\/%s' % (str2, x, x)
    str2 = r'<(%s)[^>]*>' % (str2, )
    str1 = re.sub(str2, ', ', str1, flags=re.I)
    return str1

python正则表达式用标志编译

2 个答案: