Pandas 用固定日期填充缺失的日期值

时间:2021-07-15 11:56:56

标签: pandas date missing-data imputation

我遇到了一种情况,我试图使用标准日期来估算日期列中的缺失值。我正在使用以下代码,但缺失值仍然保持原样,不会被我使用的日期替换。

df:

termination_date
2020-06-28 00:00:00

2020-07-13 00:00:00
2020-08-11 00:00:00

2020-08-11 00:00:00

现在要替换缺失值,我想使用日期“2020-07-31 00:00:00”,我使用以下代码:

df['termination_date'] = df['termination_date'].fillna(value=pd.to_datetime('2020-07-31 00:00:00'))

输出应该是这样的:

termination_date
2020-06-28 00:00:00
2020-07-31 00:00:00
2020-07-13 00:00:00
2020-08-11 00:00:00
2020-07-31 00:00:00
2020-08-11 00:00:00

2 个答案:

答案 0 :(得分:1)

将非日期时间的值转换为 NaT,因此可能替换为 fillna

df['termination_date'] = (pd.to_datetime(df['termination_date'], errors='coerce')
                            .fillna(pd.to_datetime('2020-07-31')))

#because same times 00:00:00 are not shown
print (df)
  termination_date
0       2020-06-28
1       2020-07-31
2       2020-07-13
3       2020-08-11
4       2020-07-31
5       2020-08-11

print(df['termination_date'].tolist())
[Timestamp('2020-06-28 00:00:00'), Timestamp('2020-07-31 00:00:00'),
 Timestamp('2020-07-13 00:00:00'), Timestamp('2020-08-11 00:00:00'), 
 Timestamp('2020-07-31 00:00:00'), Timestamp('2020-08-11 00:00:00')]

print (df.termination_date.dtypes)
datetime64[ns]

答案 1 :(得分:1)

来自您的DataFrame

>>> df = pd.DataFrame({'termination_date': ["2020-06-28 00:00:00",
...                                         "",
...                                         "2020-07-13 00:00:00",
...                                         "2020-08-11 00:00:00",
...                                         "",
...                                         "2020-08-11 00:00:00"]}, 
...                   index = [0, 1, 2, 3, 4, 5])
>>> df
    termination_date
0   2020-06-28 00:00:00
1   
2   2020-07-13 00:00:00
3   2020-08-11 00:00:00
4   
5   2020-08-11 00:00:00

我们可以使用 loc 将缺失值替换为 pd.to_datetime('2020-07-31 00:00:00') 以获得预期结果:

>>> df.loc[df['termination_date'] == '', 'termination_date'] = pd.to_datetime('2020-07-31 00:00:00')
>>> df
    termination_date
0   2020-06-28 00:00:00
1   2020-07-31 00:00:00
2   2020-07-13 00:00:00
3   2020-08-11 00:00:00
4   2020-07-31 00:00:00
5   2020-08-11 00:00:00

最后,我们可以将列转换为 Datetime 格式以确保我们没有 string 值:

df['termination_date'] = pd.to_datetime(df['termination_date'])