我有一个熊猫数据框,如下所示:
df_first = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103], "val1": [np.nan, 4, np.nan, np.nan, 1, np.nan], "val2": [5, np.nan, np.nan, np.nan, np.nan, 5], "rand": [np.nan, 3, 7, 8, np.nan, 4], "val3": [5, np.nan, np.nan, np.nan, 3, np.nan], "unique_date": [pd.Timestamp(2002, 3, 3), pd.Timestamp(2002, 3, 5), pd.Timestamp(2003, 4, 5), pd.Timestamp(2003, 4, 9), pd.Timestamp(2003, 8, 7), pd.Timestamp(2003, 9, 7)], "end_date": [pd.Timestamp(2005, 3, 3), pd.Timestamp(2003, 4, 7), np.nan, np.nan, pd.Timestamp(2003, 10, 7), np.nan]})
df_first
id val1 val2 rand val3 unique_date end_date
0 102 NaN 5.0 NaN 5.0 2002-03-03 2005-03-03
1 102 4.0 NaN 3.0 NaN 2002-03-05 2003-04-07
2 102 NaN NaN 7.0 NaN 2003-04-05 NaT
3 102 NaN NaN 8.0 NaN 2003-04-09 NaT
4 103 1.0 NaN NaN 3.0 2003-08-07 2003-10-07
5 103 NaN 5.0 4.0 NaN 2003-09-07 NaT
缺失值的估算应该以一种方式进行,即向前填充出现在具有end_date
值的数据框中的每一行中的值。
对于相同的unique_date
,只要end_date
在id
之前执行向前填充。
根据上面最后一段的内容,应按id
进行正向填充。
最后,缺失值的插补应该只对名称中带有val
的某些列进行。一个重要的注意事项是,没有其他列的名称具有该模式。如果我还不够清楚的话,下面发布的上述数据框的解决方案如下:
id val1 val2 rand val3 unique_date
0 102 NaN 5.0 NaN 5.0 2002-03-03
1 102 4.0 5.0 3.0 5.0 2002-03-05
2 102 4.0 5.0 7.0 5.0 2003-04-05
3 102 NaN 5.0 8.0 5.0 2003-04-09
4 103 1.0 NaN NaN 3.0 2003-08-07
5 103 1.0 5.0 4.0 3.0 2003-08-07
让我知道您是否需要进一步澄清,因为乍看之下整个过程似乎相当复杂。
期待您的回答!
答案 0 :(得分:0)
对于困惑的问题以及解释深表歉意。最后,我可以通过以下方式实现我想要的。
df_first = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103],
"val1": [np.nan, 4, np.nan, np.nan, 1, np.nan],
"val2": [5, np.nan, np.nan, np.nan, np.nan, 5],
"val3": [np.nan, 3, np.nan, np.nan, np.nan, 4],
"val4": [5, np.nan, np.nan, np.nan, 3, np.nan],
"rand": [3, np.nan, 1, np.nan, 5, 6],
"unique_date": [pd.Timestamp(2002, 3, 3),
pd.Timestamp(2002, 3, 5),
pd.Timestamp(2003, 4, 5),
pd.Timestamp(2003, 4, 9),
pd.Timestamp(2003, 8, 7),
pd.Timestamp(2003, 9, 7)],
"end_date": [pd.Timestamp(2005, 3, 3),
pd.Timestamp(2003, 4, 7),
np.nan,
np.nan,
pd.Timestamp(2003, 10, 7),
np.nan]})
display(df_first)
indexes = []
columns = df_first.filter(like="val").columns
for column in columns:
indexes.append(df_first.columns.get_loc(column))
elements = df_first.values[:,indexes]
ids = df_first.values[:,df_first.columns.get_loc("id")]
start_dates = df_first.values[:,df_first.columns.get_loc("unique_date")]
end_dates = df_first.values[:,df_first.columns.get_loc("end_date")]
for i in range(len(elements)):
if pd.notnull(end_dates[i]):
not_nan_indexes = np.argwhere(~pd.isnull(elements[i])).ravel()
elements_prop = elements[i,not_nan_indexes]
j = i
while (j < len(elements) and start_dates[j] < end_dates[i] and ids[i] == ids[j]):
elements[j, not_nan_indexes] = elements_prop
j+=1
df_first[columns] = elements
df_first = df_first.drop(columns="end_date")
display(df_first)
解决方案可能是一个过大的杀手,但是我找不到任何能实现我想要的目标的大熊猫。