我有一个充满字符串的数据集,我想分开包含Dates的字符串 我编写了以下正则表达式来提取它们:
print (re.findall(r'[Jan(uary)?|Feb(ruary)?|Mar(ch)?||April|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?]+\s\d+', x))
其中x表示正在处理的字符串。 我想获得以下格式: 例如:
December 2018
Feb 11-12
Feb 12-Mar 21
3rd Jan
February 12
然而,还提取了一些额外的字符串。像:
"Of 2017" from the string "BEST OF 2017"
"Line 1" from the string "Line 1"
"'addington 2" & "Paddington 2" from string "Paddington 2"
'hopping 3', 'as 20'
如何修复这些错误?
答案 0 :(得分:1)
你正在寻找的正则表达式有点复杂:
^(\d{1,2}\w{2} )?((Jan(uary)?|Feb(ruary)?|Mar(ch)?|April|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)[- \d]*)+$
答案 1 :(得分:0)
在https://regex101.com/进行了测试,按预期工作
/Jan(uary)?|Feb(ruary)?|Mar(ch)?|April|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?]+\s\d+/