我有一个长数据框df
,格式如下:
user_id day action1 action2 action3 action4 action5
1 0 4 2 0 1 0
1 1 4 2 0 1 0
2 1 4 2 0 1 0
操作列中的值表示用户在当天执行该操作的次数。我想将其翻译成宽DataFrame
但能够任意延长时间范围(比如说,延长到365天)。
我可以很轻松地重塑广场:
df_indexed = df.set_index(['user_id', 'day'])
df_wide = df_indexed.unstack().fillna()
如何为五个动作中的每一个添加剩余的358天填充0?
答案 0 :(得分:1)
这与@ViktorKerkez使用pandas.merge
In [83]: df
Out[83]:
user_id day action1 action2 action3 action4 action5
0 1 0 4 2 0 1 0
1 1 1 4 2 0 1 0
2 2 1 4 2 0 1 0
In [84]: days_joiner = DataFrame(dict(zip(['user_id', 'day'], zip(*list(itertools.product(df.user_id.unique(), range(365)))))))
In [85]: result = pd.merge(df, days_joiner, how='outer')
In [86]: result.head(10)
Out[86]:
user_id day action1 action2 action3 action4 action5
0 1 0 4 2 0 1 0
1 1 1 4 2 0 1 0
2 2 1 4 2 0 1 0
3 1 2 NaN NaN NaN NaN NaN
4 1 3 NaN NaN NaN NaN NaN
5 1 4 NaN NaN NaN NaN NaN
6 1 5 NaN NaN NaN NaN NaN
7 1 6 NaN NaN NaN NaN NaN
8 1 7 NaN NaN NaN NaN NaN
9 1 8 NaN NaN NaN NaN NaN
In [87]: result.fillna(0).head(10)
Out[87]:
user_id day action1 action2 action3 action4 action5
0 1 0 4 2 0 1 0
1 1 1 4 2 0 1 0
2 2 1 4 2 0 1 0
3 1 2 0 0 0 0 0
4 1 3 0 0 0 0 0
5 1 4 0 0 0 0 0
6 1 5 0 0 0 0 0
7 1 6 0 0 0 0 0
8 1 7 0 0 0 0 0
9 1 8 0 0 0 0 0
公平地说:这是两种方法的%timeit
比较
In [90]: timeit pd.merge(df, days_joiner, how='outer')
1000 loops, best of 3: 1.33 ms per loop
In [96]: timeit df_indexed.reindex(index, fill_value=0)
10000 loops, best of 3: 146 µs per loop
我的答案慢了大约9倍!
答案 1 :(得分:0)
您可以使用MultiIndexed DataFrame,使用itertools.product
创建一个新索引,将DataFrame中的所有用户与您想要的所有日期相结合,然后只需将填充缺失值的索引替换为0。
import itertools
users = df.user_id.unique()
df_indexed = df.set_index(['user_id', 'day'])
index = pd.MultiIndex.from_tuples(list(itertools.product(users, range(365))))
reindexed = df_indexed.reindex(index, fill_value=0)