我有一些非常混乱的日期,我试图进入一致的格式%Y-%m-%d如果适用的话。有些日期缺少这一天,有些日期是将来的,或者对于那些我只是标记为不正确的日期而言根本不可能。我如何解决python的这种不一致问题?
sample dates:
4-Jul-97
8/31/02
20-May-95
5/12/92
Jun-13
8/4/98
90/1/90
3/10/77
7-Dec
nan
4/3/98
Aug-76
Mar-90
Sep, 2020
Apr-74
10/10/03
Dec-00
答案 0 :(得分:2)
如果需要,可以使用dateutil解析器
from dateutil.parser import parse
bad_dates = [...]
for d in bad_dates:
try:
print parse(d)
except Exception, err:
print 'couldn\'t parse', d, err
输出
1997-07-04 00:00:00
2002-08-31 00:00:00
1995-05-20 00:00:00
1992-05-12 00:00:00
2015-06-13 00:00:00
1998-08-04 00:00:00
couldn't parse 90/1/90 day is out of range for month
1977-03-10 00:00:00
2015-12-07 00:00:00
couldn't parse nan unknown string format
1998-04-03 00:00:00
1976-08-30 00:00:00
1990-03-30 00:00:00
2020-09-30 00:00:00
1974-04-30 00:00:00
2003-10-10 00:00:00
couldn't parse Dec-00 day is out of range for month
如果你想标记任何不是一个简单的解析,你可以检查他们是否有3个部分要解析,如果他们尝试解析它或者标记它就像这样
flagged, good = [],[]
splitters = ['-', ',', '/']
for d in bad_dates:
try:
a = None
for s in splitters:
if len(d.split(s)) == 3:
a = parse(d)
good.append(a)
if not a:
raise Exception
except Exception, err:
flagged.append(d)
答案 1 :(得分:2)
有些价值观含糊不清。您可以根据优先级获得不同的结果,例如,如果您希望所有日期得到一致处理;您可以指定要尝试的格式列表:
#!/usr/bin/env python
import re
import sys
from datetime import datetime
for line in sys.stdin:
date_string = " ".join(re.findall(r'\w+', line)) # normalize delimiters
for date_format in ["%d %b %y", "%m %d %y", "%b %y", "%d %b", "%b %Y"]:
try:
print(datetime.strptime(date_string, date_format).date())
break
except ValueError:
pass
else: # no break
sys.stderr.write("failed to parse " + line)
示例:
$ python . <input.txt
1997-07-04
2002-08-31
1995-05-20
1992-05-12
2013-06-01
1998-08-04
failed to parse 90/1/90
1977-03-10
1900-12-07
failed to parse nan
1998-04-03
1976-08-01
1990-03-01
2020-09-01
1974-04-01
2003-10-10
2000-12-01
您可以使用其他条件,例如,即使某些日期处理不一致,您也可以最大限度地成功解析日期数(dateutil
,pandas
解决方案可能会提供此类别的解决方案。)< / p>
答案 2 :(得分:1)
pd.datetools.to_datetime
会猜测你,你的大多数日期似乎都没问题,虽然你可能想要提出一些额外的规则吗?
df['sample'].map(lambda x : pd.datetools.to_datetime(x))
Out[52]:
0 1997-07-04 00:00:00
1 2002-08-31 00:00:00
2 1995-05-20 00:00:00
3 1992-05-12 00:00:00
4 2015-06-13 00:00:00
5 1998-08-04 00:00:00
6 90/1/90
7 1977-03-10 00:00:00
8 2015-12-07 00:00:00
9 NaN
10 1998-04-03 00:00:00
11 1976-08-01 00:00:00
12 1990-03-01 00:00:00
13 2015-09-01 00:00:00
14 1974-04-01 00:00:00
15 2003-10-10 00:00:00
16 Dec-00
Name: sample, dtype: object