正则表达式:匹配任何字符零次或多次,除了数字"触摸"匹配组

时间:2017-12-17 14:51:02

标签: python regex

我很难创建一个正则表达式(在Python 3.6中),我可以根据这些规则解析日期时间字符串:

  • 日期始终采用YYYY的形式,其中20yyMMDD是2000年至2099年(含)之间的任何年份,因此成为HHMMSS
  • 时间总是以YYYYMMDDHHMMSS
  • 的形式出现
  • 日期总是在时间之前,如YYYYMMDDHHMMSS
  • 日期和时间可以分隔,不包含任何字符或任何非数字字符
    • YYYYMMDD-HHMMSS - 确定
    • YYYYMMDD HHMMSS - 确定
    • YYYYMMDD1HHMMSS - 确定
    • (YYYYMMDDHHMMSS) - 不接受
  • 前面或末尾可以有任何字符,除了字符"触摸"日期字符串必须是非数字的
    • 123-YYYYMMDDHHMMSS)123 - 确定
    • abc1YYYYMMDDHHMMSS - 确定
    • (20[\d]{6})([\d]{6}) - 不接受

我了解正则表达式的基础知识,阅读了许多SO答案(找到Regex: match everything butRegex, every non-alphanumeric character except white space or colon和其他非常有用的答案),但是无法弄清楚正则表达式是否能通过我的所有测试用例。

我需要两组来解析实际的日期和时间,即.*(20[\d]{6})[^\d]?([\d]{6}).*。然后我添加了对其他字符(?<![\d])的支持,这些字符工作正常,直到前面,末尾或中间有一个数字字符,它不匹配,但它匹配。所以我开始在前面或后面添加不同的东西,例如.*[^\d]?[^\d]?.*import datetime import re from typing import Tuple, List #my_regex = r"(?<![\d])(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*" my_regex = r"\b(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*" dt = datetime.datetime(2017, 12, 17, 9, 10, 11) tests: List[Tuple[str, datetime.datetime]] = [ # Clean one. ("20171217091011", dt), # Character in between. ("20171217a091011", dt), ("20171217b091011", dt), ("20171217-091011", dt), ("20171217_091011", dt), ("20171217 091011", dt), ("201712170091011", None), # Before/in between/at the end in this case. # Characters in front. ("a20171217091011", dt), ("b20171217091011", dt), (" 20171217091011", dt), ("-20171217091011", dt), ("_20171217091011", dt), ("020171217091011", None), ("aa20171217091011", dt), ("a1-20171217091011", dt), ("123_20171217091011", dt), ("123 20171217091011", dt), ("123=20171217091011", dt), ("201720171217091011", None), # Characters at the end. ("20171217091011a", dt), ("20171217091011b", dt), ("20171217091011 ", dt), ("20171217091011-", dt), ("20171217091011_", dt), ("201712170910110", None), ("20171217091011aa", dt), ("20171217091011a1", dt), ("20171217091011-a1", dt), ("20171217091011-123", dt), ("20171217091011_123", dt), ("20171217091011 123", dt), ("20171217091011?123", dt), # Characters at both ends. ("a20171217091011a", dt), ("(20171217091011)", dt), ("a-20171217091011 b", dt), ("123(20171217091011)456", dt), (" 20171217091011 ", dt), ("2017 20171217091011 2017", dt), ("20171218-20171217091011-070809", dt), # Characters at both ends and in the middle. ("123(20171217-091011)456", dt), ("a2017(20171217 091011)b", dt), ("2017xx(20171217?091011)cc2017", dt), ("2017xx(201712170091011)cc2017", None), ("2017xx(201712170091011", None), # Other cases. ("20171217091011 20171116080910", dt), # Match first. ("A-20171116-080910-20171217091011", datetime.datetime(2017, 11, 16, 8, 9, 10)), # Match first. ] for test_str, test_time in tests: match = re.match(my_regex, test_str) time = None if match: try: time = datetime.datetime.strptime("".join(match.groups()), "%Y%m%d%H%M%S") except ValueError: pass if time != test_time: print("{: <32s} = {} instead of {}".format(test_str, time, test_time)) ,...但不幸的是,我的正则表达式知识很快结束,字符串变得一团糟我不明白也不正常。

我制作了一些测试字符串(每个都有所需的结果)和一个简单的测试函数:

a20171217091011                  = None instead of 2017-12-17 09:10:11
b20171217091011                  = None instead of 2017-12-17 09:10:11
 20171217091011                  = None instead of 2017-12-17 09:10:11
-20171217091011                  = None instead of 2017-12-17 09:10:11
_20171217091011                  = None instead of 2017-12-17 09:10:11
aa20171217091011                 = None instead of 2017-12-17 09:10:11
a1-20171217091011                = None instead of 2017-12-17 09:10:11
123_20171217091011               = None instead of 2017-12-17 09:10:11
123 20171217091011               = None instead of 2017-12-17 09:10:11
123=20171217091011               = None instead of 2017-12-17 09:10:11
201712170910110                  = 2017-12-17 09:10:11 instead of None
a20171217091011a                 = None instead of 2017-12-17 09:10:11
(20171217091011)                 = None instead of 2017-12-17 09:10:11
a-20171217091011 b               = None instead of 2017-12-17 09:10:11
123(20171217091011)456           = None instead of 2017-12-17 09:10:11
 20171217091011                  = None instead of 2017-12-17 09:10:11
2017 20171217091011 2017         = None instead of 2017-12-17 09:10:11
20171218-20171217091011-070809   = 2017-12-18 20:17:12 instead of 2017-12-17 09:10:11
123(20171217-091011)456          = None instead of 2017-12-17 09:10:11
a2017(20171217 091011)b          = None instead of 2017-12-17 09:10:11
2017xx(20171217?091011)cc2017    = None instead of 2017-12-17 09:10:11
A-20171116-080910-20171217091011 = None instead of 2017-11-16 08:09:10

但我无法通过所有测试字符串,例如:

{{1}}

感谢您的任何想法。

1 个答案:

答案 0 :(得分:2)

您似乎需要使用正则表达式检查常规模式,同时使用适当的Python方法验证实际日期时间值。

因此,您可以使用以下正则表达式修复代码:

r'(?<!\d)20\d{6}\D?\d{6}(?!\d)'

请参阅regex demo

<强>详情

  • (?<!\d) - 如果当前位置左侧有一位数字,则会导致匹配失败的负面背后隐藏
  • 20 - 20子字符串
  • \d{6} - 任意6位数字​​
  • \D? - 1或0个非数字字符
  • \d{6} - 任意6位数字​​
  • (?!\d) - 如果当前位置右侧有一个数字,则表示匹配失败的否定前瞻。