我有一个包含许多空值的数据框。我想用来自同一列但在以后日期的同一用户的数据填充空值。这是数据框:
import pandas as pd
import numpy as np
array = {'user': ['Trevor', 'John', 'Trevor', 'John', 'Trevor', 'Trevor', 'John'], 'date': ['2020-10-11 08:00:00', '2020-10-15 08:00:00', '2020-10-17 08:00:00', '2020-10-19 08:00:00', '2020-10-10 08:00:00'
, '2020-11-11 12:34:00', '2020-11-16 09:12:00'], 'test1': [5,np.nan,np.nan,np.nan,np.nan,8,4],
'test2': [np.nan,8,3,np.nan,1,8,6], 'test3': [np.nan,np.nan,3,5,np.nan,8,np.nan]}
df = pd.DataFrame(array)
df.sort_values(by=['user', 'date'], ascending = True)
user date test1 test2 test3
1 John 2020-10-15 08:00:00 NaN 8.0 NaN
3 John 2020-10-19 08:00:00 NaN NaN 5.0
6 John 2020-11-16 09:12:00 4.0 6.0 NaN
4 Trevor 2020-10-10 08:00:00 NaN 1.0 NaN
0 Trevor 2020-10-11 08:00:00 5.0 NaN NaN
2 Trevor 2020-10-17 08:00:00 NaN 3.0 3.0
5 Trevor 2020-11-11 12:34:00 8.0 8.0 8.0
这是所需的输出:
user test1 test2 test3
0 John 4 8 5
1 Trevor 5 1 3
答案 0 :(得分:3)
我不完全理解“用来自同一列但在以后日期的同一用户的数据填充空值”与您发布的所需输出之间的关系,但您可以使用 pivot_table
获得所需的内容:
# Added inplace=True
df.sort_values(by=['user', 'date'], ascending = True,inplace=True)
# Pivot table using 'first'
df.pivot_table(index='user',
aggfunc='first'). \
drop('date',axis=1)
test1 test2 test3
user
John 4.0 8.0 5.0
Trevor 5.0 1.0 3.0
如果我不理解你,请纠正我。