正则表达式用于从python中的字符串中提取所有复杂的日期格式

时间:2018-07-01 10:20:33

标签: python regex date

我有以下字符串:

 dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"

在这里,我想使用regex提取所有提到的日期。作为尝试,我写了以下regex

import re

regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)?(?:\d{2,4})'

re.findall(regEx, dateEntries)

我希望它能起作用,但是它只返回日期的子集。

A = ['Mar 20, 2009',
 'March 20, 2009',
 'Mar. 20, 2009',
 'Mar 20 2009',
 '20 Mar 2009',
 '20 March 2009',
 '2 Mar. 2009',
 '20 March, 2009',
 'Mar 20th, 2009',
 'Mar 21st, 2009',
 'Mar 22nd, 2009',
 'Feb 2009',
 'Sep 2009',
 'Oct 2010']

我不明白为什么它不返回日期:

B=[04-20-2009; 04/20/09; 4/20/09; 4/3/09; 6/2008; 12/2009; 2009; 2010"]

我通过扩展regEx来创建r'(?:\d{1,2}[-\s\/])?(?:\d{1,2}[-\/\s])?(?:\d{2,4})',它对集合B很好。但是regEx无法产生A+B

任何人都可以帮助制作正则表达式以提取我的dateEntries中提到的所有日期吗?

注意::我只想使用正则表达式来解决这个问题。

4 个答案:

答案 0 :(得分:2)

?组之后,您只是缺少了一个(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec),以将其标记为不必要。另外,我在最后两组后面添加了+,以确保正则表达式不会将“ 2009年3月20日”之类的日期拆分为两个不同的日期。

完整代码:

import re

regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+'

dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
result = re.findall(regEx, dateEntries)
print(result)

如果您的日期有前导空格,则结果也将有前导空格。如果您继续使用日期字符串,则可以将其删除,例如with the .strip() method

答案 1 :(得分:0)

尝试正则表达式:

^(?:\d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?\s))?(?:(?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)(?:(?:-|/)|(?:,|\.)?\s)?)?(?:\d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?\s))?)(?:\d{2,4})$

Demo

答案 2 :(得分:0)

您的正则表达式模式是完全不可读的。.请使用简单的构建块来构建您的正则表达式模式。这将使代码更具可读性

import re
import calendar

full_months = [month for month in calendar.month_name if month]
short_months = [d[:3] for d in full_months]
months = '|'.join(short_months + full_months)

sep = r'[.,]?\s+'               # seperator
day = r'\d+'
year = r'\d+'
day_or_year = r'\d+(?:\w+)?'

r = re.compile(rf'(?:{day}{sep})?(?:{months}){sep}{day_or_year}(?:{sep}{year})?')
r.findall(dateEntries)
# ['Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 'Mar 20 2009', '20 Mar 2009', '20 March 2009', '2 Mar. 2009', '20 March, 2009', 'Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009', 'Feb 2009', 'Sep 2009', 'Oct 2010']

答案 3 :(得分:0)

您可以尝试以下正则表达式

(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+