用python格式化真正不一致的日期

时间:2015-07-01 04:12:19

标签: python datetime pandas

我有一些非常混乱的日期,我试图进入一致的格式%Y-%m-%d如果适用的话。有些日期缺少这一天,有些日期是将来的,或者对于那些我只是标记为不正确的日期而言根本不可能。我如何解决python的这种不一致问题?

sample dates:
4-Jul-97
8/31/02
20-May-95
5/12/92
Jun-13
8/4/98
90/1/90
3/10/77
7-Dec
nan
4/3/98
Aug-76
Mar-90
Sep, 2020
Apr-74
10/10/03
Dec-00

3 个答案:

答案 0 :(得分:2)

如果需要,可以使用dateutil解析器

from dateutil.parser import parse
bad_dates = [...]
for d in bad_dates:
    try:
        print parse(d)
    except Exception, err:
        print 'couldn\'t parse', d, err

输出

1997-07-04 00:00:00
2002-08-31 00:00:00
1995-05-20 00:00:00
1992-05-12 00:00:00
2015-06-13 00:00:00
1998-08-04 00:00:00
couldn't parse 90/1/90 day is out of range for month
1977-03-10 00:00:00
2015-12-07 00:00:00
couldn't parse nan unknown string format
1998-04-03 00:00:00
1976-08-30 00:00:00
1990-03-30 00:00:00
2020-09-30 00:00:00
1974-04-30 00:00:00
2003-10-10 00:00:00
couldn't parse Dec-00 day is out of range for month

如果你想标记任何不是一个简单的解析,你可以检查他们是否有3个部分要解析,如果他们尝试解析它或者标记它就像这样

flagged, good = [],[]
splitters = ['-', ',', '/']
for d in bad_dates:
    try:
        a = None
        for s in splitters:
            if len(d.split(s)) == 3:
                a = parse(d)
                good.append(a)
        if not a:
            raise Exception
    except Exception, err:
        flagged.append(d)

答案 1 :(得分:2)

有些价值观含糊不清。您可以根据优先级获得不同的结果,例如,如果您希望所有日期得到一致处理;您可以指定要尝试的格式列表:

#!/usr/bin/env python
import re
import sys
from datetime import datetime

for line in sys.stdin:
    date_string = " ".join(re.findall(r'\w+', line)) # normalize delimiters
    for date_format in ["%d %b %y", "%m %d %y", "%b %y", "%d %b", "%b %Y"]:
        try:
            print(datetime.strptime(date_string, date_format).date())
            break
        except ValueError:
            pass
    else: # no break
        sys.stderr.write("failed to parse " + line)

示例:

$ python . <input.txt 
1997-07-04
2002-08-31
1995-05-20
1992-05-12
2013-06-01
1998-08-04
failed to parse 90/1/90
1977-03-10
1900-12-07
failed to parse nan
1998-04-03
1976-08-01
1990-03-01
2020-09-01
1974-04-01
2003-10-10
2000-12-01

您可以使用其他条件,例如,即使某些日期处理不一致,您也可以最大限度地成功解析日期数(dateutilpandas解决方案可能会提供此类别的解决方案。)< / p>

答案 2 :(得分:1)

pd.datetools.to_datetime会猜测你,你的大多数日期似乎都没问题,虽然你可能想要提出一些额外的规则吗?

df['sample'].map(lambda x : pd.datetools.to_datetime(x))
Out[52]: 
0     1997-07-04 00:00:00
1     2002-08-31 00:00:00
2     1995-05-20 00:00:00
3     1992-05-12 00:00:00
4     2015-06-13 00:00:00
5     1998-08-04 00:00:00
6                 90/1/90
7     1977-03-10 00:00:00
8     2015-12-07 00:00:00
9                     NaN
10    1998-04-03 00:00:00
11    1976-08-01 00:00:00
12    1990-03-01 00:00:00
13    2015-09-01 00:00:00
14    1974-04-01 00:00:00
15    2003-10-10 00:00:00
16                 Dec-00
Name: sample, dtype: object