下面是我的df样本
date value
0006-03-01 00:00:00 1
0006-03-15 00:00:00 2
0006-05-15 00:00:00 1
0006-07-01 00:00:00 3
0006-11-01 00:00:00 1
2009-05-20 00:00:00 2
2009-05-25 00:00:00 8
2020-06-24 00:00:00 1
2020-06-30 00:00:00 2
2020-07-01 00:00:00 13
2020-07-15 00:00:00 2
2020-08-01 00:00:00 4
2020-10-01 00:00:00 2
2020-11-01 00:00:00 4
2023-04-01 00:00:00 1
2218-11-12 10:00:27 1
4000-01-01 00:00:00 6
5492-04-15 00:00:00 1
5496-03-15 00:00:00 1
5589-12-01 00:00:00 1
7199-05-15 00:00:00 1
9186-12-30 00:00:00 1
如您所见,数据包含一些拼写错误的日期。
问题:
最终输出应如下所示。
date value
20.05.2009 2
25.05.2009 8
26.04.2020 1
30.06.2020 2
01.07.2020 13
15.07.2020 2
01.08.2020 4
01.10.2020 2
01.11.2020 4
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
我尝试使用to_datetime转换列,但失败。
df[col] = pd.to_datetime(df[col], infer_datetime_format=True)
Out of bounds nanosecond timestamp: 5-03-01 00:00:00
感谢任何帮助!
答案 0 :(得分:1)
您可以在'-'分割后检查日期时间字符串的第一个元素,并根据其整数值进行清理/替换。对于诸如“ 0006”之类的较小值,可以使用pd.to_datetime
来调用errors='coerce'
。它将为无效日期保留“ NaT”。您可以使用dropna()
删除它们。示例:
import pandas as pd
df = pd.DataFrame({'date': ['0006-03-01 00:00:00',
'0006-03-15 00:00:00',
'0006-05-15 00:00:00',
'0006-07-01 00:00:00',
'0006-11-01 00:00:00',
'nan',
'2009-05-25 00:00:00',
'2020-06-24 00:00:00',
'2020-06-30 00:00:00',
'2020-07-01 00:00:00',
'2020-07-15 00:00:00',
'2020-08-01 00:00:00',
'2020-10-01 00:00:00',
'2020-11-01 00:00:00',
'2023-04-01 00:00:00',
'2218-11-12 10:00:27',
'4000-01-01 00:00:00',
'NaN',
'5496-03-15 00:00:00',
'5589-12-01 00:00:00',
'7199-05-15 00:00:00',
'9186-12-30 00:00:00']})
# first, drop columns where 'date' contains 'nan' (case-insensitive):
df = df.loc[~df['date'].str.contains('nan', case=False)]
# now replace strings where the year is above a threshold:
df.loc[df['date'].str.split('-').str[0].astype(int) > 2022, 'date'] = '2100-01-01 00:00:00'
# convert to datetime, if year is too low, will result in NaT:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['date']
# 0 NaT
# 1 NaT
# 2 NaT
# 3 NaT
# 4 NaT
# 5 2009-05-20
# 6 2009-05-25
# ...
df = df.dropna()
# df
# date
# 6 2009-05-25
# 7 2020-06-24
# 8 2020-06-30
# 9 2020-07-01
# 10 2020-07-15
# 11 2020-08-01
# 12 2020-10-01
# 13 2020-11-01
# 14 2100-01-01
# 15 2100-01-01
# ...
答案 1 :(得分:0)
由于熊猫的局限性,引发了超出范围的错误(https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)。此代码将在创建数据框之前删除可能导致此错误的值。
import datetime as dt
import pandas as pd
data = [[dt.datetime(year=2022, month=3, day=1), 1],
[dt.datetime(year=2009, month=5, day=20), 2],
[dt.datetime(year=2001, month=5, day=20), 2],
[dt.datetime(year=2023, month=12, day=30), 3],
[dt.datetime(year=6, month=12, day=30), 3]]
dataCleaned = [elements for elements in data if pd.Timestamp.max > elements[0] > pd.Timestamp.min]
df = pd.DataFrame(dataCleaned, columns=['date', 'Value'])
print(df)
# OUTPUT
date Value
0 2022-03-01 1
1 2009-05-20 2
2 2001-05-20 2
3 2023-12-30 3
df.loc[df.date.dt.year > 2022, 'date'] = dt.datetime(year=2100, month=1, day=1)
df.drop(df.loc[df.date.dt.year < 2005, 'date'].index, inplace=True)
print(df)
#OUTPUT
0 2022-03-01 1
1 2009-05-20 2
3 2100-01-01 3
如果您仍要包括引发超出范围错误的日期,请查看How to work around Python Pandas DataFrame's "Out of bounds nanosecond timestamp" error?
答案 2 :(得分:0)
我建议以下内容:
df = pd.DataFrame.from_dict({'date': ['0003-03-01 00:00:00',
'7199-05-15 00:00:00',
'2020-10-21 00:00:00'],
'value': [1, 2, 3]})
df['date'] = [d[8:10] + '.' + d[5:7] + '.' + d[:4] if '2004' < d[:4] < '2023' \
else '01.01.2100' if d[:4] > '2022' else np.NaN for d in df['date']]
df.dropna(inplace = True)
这将产生所需的输出:
date value
01.01.2100 2
21.10.2020 3