我有一个带有“日期”列的数据集,该数据集具有多种格式的日期,包括:
还有无效的日期,例如:
我正在尝试查找具有确切日期(日,月和年)的日期,并将其转换为日期时间。我还需要在字段中使用“已报告”排除日期。有什么方法可以过滤掉这些数据,而不必先找到所有可能的日期格式?
答案 0 :(得分:1)
使用dateutil库。
if语句以检查是否缺少日期(月,年,日期)的任何部分,如果是,则避免使用。
如果要从字符串中提取日期,请使用fuzzy=True
,例如“ Reported 01 Jun 2018”
import dateutil.parser
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formated_date = []
for date in dates:
try:
if dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2015, 1, 1)) == dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2016, 2, 2)):
formated_date.append(yourdate)
except:
continue
另一种解决方案。这是蛮力方法,它以每种格式检查每个日期。继续添加更多格式以使其适用于任何日期格式。但这是耗时的方法。
import datetime
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formats = ["%Y%m%d","%Y.%m.%d","%Y-%m-%d","%Y/%m/%d","%Y%a%d","%Y.%a.%d","%Y-%a-%d","%Y%A%d","%Y.%A.%d","%Y-%A-%d",
"%d-%m-%Y","%d.%m.%Y","%d%m%Y","%d/%m/%Y","%d-%b-%Y","%d%b%Y","%d.%b.%Y","%d/%b/%Y"]
formated_date = []
for date in dates:
for fmt in formats:
try:
dt = datetime.datetime.strptime(date,fmt)
formated_date.append(dt)
except:
continue
答案 1 :(得分:0)
In [1]: string_with_dates = """entries are due by January 4th, 2017 at 8:00pm created 01/15/2005 by ACME Inc. and associates."""
In [2]: import datefinder
In [3]: matches = datefinder.find_dates(string_with_dates)
In [4]: for match in matches:
...: print match
2017-01-04 20:00:00
2005-01-15 00:00:00
希望这可以帮助您从带有日期的字符串中查找日期