我在发布此问题之前已经通过了多个链接,所以请通读以及下面的两个答案已经解决了我的问题的90%:
parse multiple dates using dateutil
How to parse multiple dates from a block of text in Python (or another language)
问题:我需要在Python中解析多种格式的多个日期
通过上方链接解决方案:我能够这样做,但仍有某些格式我无法这样做。
仍无法解析的格式为:
text ='我想参加5月16日至5月18日'
text ='我想从5月16日至18日访问
text ='我想参加5月6日的5月6日'
我也尝试过正则表达式,但由于日期可以采用任何格式,因此排除了该选项,因为代码变得非常复杂。因此,请建议我对链接上显示的代码进行修改,以便同样可以处理上述3种格式。
答案 0 :(得分:1)
这种问题总是需要使用新的边缘情况进行调整,但以下方法相当强大:
from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re
def get_date_part(x):
if x.lower() in month_list:
return x
day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)
if day:
return day.group(1)
return False
def month_full(month):
try:
return datetime.strptime(month, '%B').strftime('%b')
except:
return datetime.strptime(month, '%b').strftime('%b')
tests = [
'I want to visit from May 16-May 18',
'I want to visit from May 16-18',
'I want to visit from May 6 May 18',
'May 6,7,8,9,10',
'8 May to 10 June',
'July 10/20/30',
'from June 1, july 5 to aug 5 please',
'2nd March to the 3rd January',
'15 march, 10 feb, 5 jan',
'1 nov 2017',
'27th Oct 2010 until 1st jan',
'27th Oct 2010 until 1st jan 2012'
]
cur_year = 2017
month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))
for date in tests:
date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]
days = []
months = []
years = []
for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
values = list(g)
if k:
months = map(month_full, values)
else:
for v in values:
if 1900 <= int(v) <= 2100:
years.append(int(v))
else:
days.append(v)
if days and months:
if years:
dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]
else:
dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
years = [cur_year]
# Fix for jumps in year
dates = []
start_date = datetime(years[0], 1, 1)
next_year = years[0] + 1
for d in dates_raw:
if d < start_date:
d = d.replace(year=next_year)
next_year += 1
start_date = d
dates.append(d)
print "{} -> {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))
这将转换测试字符串如下:
I want to visit from May 16-May 18 -> 16/05/2017, 18/05/2017
I want to visit from May 16-18 -> 16/05/2017, 18/05/2017
I want to visit from May 6 May 18 -> 06/05/2017, 18/05/2017
May 6,7,8,9,10 -> 06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June -> 08/05/2017, 10/06/2017
July 10/20/30 -> 10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please -> 01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January -> 02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan -> 15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017 -> 01/11/2017
27th Oct 2010 until 1st jan -> 27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012 -> 27/10/2010, 01/01/2012
其工作原理如下:
首先创建一个有效月份名称列表,即完整和缩写。
制作翻译表,以便快速删除文字中的任何标点符号。
拆分文本,并使用带正则表达式的函数仅提取日期部分以发现日期或月份。
根据部分是否为数字对列表进行排序,这会将月份分组到前面,将数字分组到末尾。
取每个列表的第一部分和最后一部分。将月份转换为完整形式,例Aug
到August
并将每个转换为datetime
个对象。
如果日期显示在上一个日期之前,请添加一整年。