Question

我的python代码中有以下正则表达式，它真的很长。由于python是一种以空格分隔的语言，我该如何清理它呢？

matches = re.findall("((?:jan(?:(?:.)?|(?:uary)?)|feb(?:(?:.)?|(?:ruary)?)|mar(?:(?:.)?|(?:ch)?)|apr(?:(?:.)?|(?:il)?)|may|jun(?:(?:.)?|(?:e)?)|jul(?:(?:.)?|(?:y)?)|aug(?:(?:.)?|(?:gust)?)|sep(?:(?:.)?|(?:ept(?:(?:.)?))?|(?:tember)?)|oct(?:(?:.)?|(?:ober)?)|nov(?:(?:.)?|(?:ember)?)|dec(?:(?:.)?|(?:ember)?)) (?:[12][0-9]|[1-9]))",fileText,re.IGNORECASE)

非常感谢任何帮助。

Answer 1

您可以使用re.VERBOSE标志将正则表达式分成多行。

请注意，要使用多个标志，您必须使用按位运算符：

flags = re.IGNORECASE | re.VERBOSE

Answer 2

我更喜欢写这样复杂的正则表达式：

r"""(?x)
    ....
"""

，其中

r以原始文字开头，因此斜杠只会转义一次
"""开始多行文字
(?x)打开扩展（详细）模式：忽略空格，允许评论

对于你的例子：

date = r"""(?xi)

    (?:  # this is a comment
          jan (?: \.|uary)?
        | feb (?: \.|ruary)?
        | mar (?: \.|ch)?
        | apr (?: \.|il)?

        etc
    )
    (?: # well, how about 30, 31?
        [12][0-9] | [1-9]
    )

"""

(?xi)之类的内联标记比re.XXX更具可读性，因为它们与表达式本身绑定，属于它们。

Answer 3

这是你想要的吗？

import re

regx = re.compile("("
                  "(?:"
                  "jan(?:\.|uary)"
                  "|"
                  "feb(?:\.|ruary)"
                  "|"
                  "mar(?:\.|ch)"
                  "|"
                  "apr(?:\.|il)"
                  "|"
                  "may"
                  "|"
                  "ju(?:n[.e]|l[.y])"
                  "|"
                  "aug(?:\.|ust)"
                  "|"
                  "sep(?:\.|tember)"
                  "|"
                  "oct(?:\.|ober)"
                  "|"
                  "(?:nov|dec)(?:\.|ember)"
                  ")"
                  " (?:[12][0-9]|[1-9]|3[01])"
                  ")",
                  re.IGNORECASE)


s = "ght july 24 tiren august 23 hyu jan. 11"

print regx.findall(s)

结果

['july 24', 'august 23', 'jan. 11']

在括号之间，圆点失去了它的特殊含义。

如何清理正则表达式

3 个答案: