我很难创建一个正则表达式(在Python 3.6中),我可以根据这些规则解析日期时间字符串:
YYYY
的形式,其中20yyMMDD
是2000年至2099年(含)之间的任何年份,因此成为HHMMSS
YYYYMMDDHHMMSS
YYYYMMDDHHMMSS
YYYYMMDD-HHMMSS
- 确定YYYYMMDD HHMMSS
- 确定YYYYMMDD1HHMMSS
- 确定(YYYYMMDDHHMMSS)
- 不接受123-YYYYMMDDHHMMSS)123
- 确定abc1YYYYMMDDHHMMSS
- 确定(20[\d]{6})([\d]{6})
- 不接受我了解正则表达式的基础知识,阅读了许多SO答案(找到Regex: match everything but,Regex, every non-alphanumeric character except white space or colon和其他非常有用的答案),但是无法弄清楚正则表达式是否能通过我的所有测试用例。
我需要两组来解析实际的日期和时间,即.*(20[\d]{6})[^\d]?([\d]{6}).*
。然后我添加了对其他字符(?<![\d])
的支持,这些字符工作正常,直到前面,末尾或中间有一个数字字符,它不匹配,但它匹配。所以我开始在前面或后面添加不同的东西,例如.*[^\d]?
,[^\d]?.*
,import datetime
import re
from typing import Tuple, List
#my_regex = r"(?<![\d])(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"
my_regex = r"\b(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"
dt = datetime.datetime(2017, 12, 17, 9, 10, 11)
tests: List[Tuple[str, datetime.datetime]] = [
# Clean one.
("20171217091011", dt),
# Character in between.
("20171217a091011", dt),
("20171217b091011", dt),
("20171217-091011", dt),
("20171217_091011", dt),
("20171217 091011", dt),
("201712170091011", None), # Before/in between/at the end in this case.
# Characters in front.
("a20171217091011", dt),
("b20171217091011", dt),
(" 20171217091011", dt),
("-20171217091011", dt),
("_20171217091011", dt),
("020171217091011", None),
("aa20171217091011", dt),
("a1-20171217091011", dt),
("123_20171217091011", dt),
("123 20171217091011", dt),
("123=20171217091011", dt),
("201720171217091011", None),
# Characters at the end.
("20171217091011a", dt),
("20171217091011b", dt),
("20171217091011 ", dt),
("20171217091011-", dt),
("20171217091011_", dt),
("201712170910110", None),
("20171217091011aa", dt),
("20171217091011a1", dt),
("20171217091011-a1", dt),
("20171217091011-123", dt),
("20171217091011_123", dt),
("20171217091011 123", dt),
("20171217091011?123", dt),
# Characters at both ends.
("a20171217091011a", dt),
("(20171217091011)", dt),
("a-20171217091011 b", dt),
("123(20171217091011)456", dt),
(" 20171217091011 ", dt),
("2017 20171217091011 2017", dt),
("20171218-20171217091011-070809", dt),
# Characters at both ends and in the middle.
("123(20171217-091011)456", dt),
("a2017(20171217 091011)b", dt),
("2017xx(20171217?091011)cc2017", dt),
("2017xx(201712170091011)cc2017", None),
("2017xx(201712170091011", None),
# Other cases.
("20171217091011 20171116080910", dt), # Match first.
("A-20171116-080910-20171217091011", datetime.datetime(2017, 11, 16, 8, 9, 10)), # Match first.
]
for test_str, test_time in tests:
match = re.match(my_regex, test_str)
time = None
if match:
try:
time = datetime.datetime.strptime("".join(match.groups()), "%Y%m%d%H%M%S")
except ValueError:
pass
if time != test_time:
print("{: <32s} = {} instead of {}".format(test_str, time, test_time))
,...但不幸的是,我的正则表达式知识很快结束,字符串变得一团糟我不明白也不正常。
我制作了一些测试字符串(每个都有所需的结果)和一个简单的测试函数:
a20171217091011 = None instead of 2017-12-17 09:10:11
b20171217091011 = None instead of 2017-12-17 09:10:11
20171217091011 = None instead of 2017-12-17 09:10:11
-20171217091011 = None instead of 2017-12-17 09:10:11
_20171217091011 = None instead of 2017-12-17 09:10:11
aa20171217091011 = None instead of 2017-12-17 09:10:11
a1-20171217091011 = None instead of 2017-12-17 09:10:11
123_20171217091011 = None instead of 2017-12-17 09:10:11
123 20171217091011 = None instead of 2017-12-17 09:10:11
123=20171217091011 = None instead of 2017-12-17 09:10:11
201712170910110 = 2017-12-17 09:10:11 instead of None
a20171217091011a = None instead of 2017-12-17 09:10:11
(20171217091011) = None instead of 2017-12-17 09:10:11
a-20171217091011 b = None instead of 2017-12-17 09:10:11
123(20171217091011)456 = None instead of 2017-12-17 09:10:11
20171217091011 = None instead of 2017-12-17 09:10:11
2017 20171217091011 2017 = None instead of 2017-12-17 09:10:11
20171218-20171217091011-070809 = 2017-12-18 20:17:12 instead of 2017-12-17 09:10:11
123(20171217-091011)456 = None instead of 2017-12-17 09:10:11
a2017(20171217 091011)b = None instead of 2017-12-17 09:10:11
2017xx(20171217?091011)cc2017 = None instead of 2017-12-17 09:10:11
A-20171116-080910-20171217091011 = None instead of 2017-11-16 08:09:10
但我无法通过所有测试字符串,例如:
{{1}}
感谢您的任何想法。
答案 0 :(得分:2)
您似乎需要使用正则表达式检查常规模式,同时使用适当的Python方法验证实际日期时间值。
因此,您可以使用以下正则表达式修复代码:
r'(?<!\d)20\d{6}\D?\d{6}(?!\d)'
请参阅regex demo
<强>详情
(?<!\d)
- 如果当前位置左侧有一位数字,则会导致匹配失败的负面背后隐藏20
- 20
子字符串\d{6}
- 任意6位数字\D?
- 1或0个非数字字符\d{6}
- 任意6位数字(?!\d)
- 如果当前位置右侧有一个数字,则表示匹配失败的否定前瞻。