我有一个巨大的Pandas数据框df,具有超过400万行,如下所示。
id value percent value_1 percent_1
0 1 0.530106 21%-31% NaN NaN
1 2 0.086647 10%-20% NaN NaN
2 3 0.073121 $30%-40% NaN NaN
3 4 0.76891 81%-90% NaN NaN
4 5 0.86536 41%-50% NaN NaN
5 1 NaN NaN 0.630106 91%-100%
6 2 NaN NaN 0.086647 11%-20%
7 3 NaN NaN 0.073121 $0%-10%
8 4 NaN NaN 0.376891 81%-90%
9 5 NaN NaN 0.186536 41%-50%
我想要一个如下所示的数据框
id value percent value_1 percent_1
0 1 0.530106 21%-31% 0.630106 91%-100%
1 2 0.086647 10%-20% 0.086647 11%-20%
2 3 0.073121 $30%-40% 0.073121 $0%-10%
3 4 0.76891 81%-90% 0.376891 81%-90%
4 5 0.86536 41%-50% 0.186536 41%-50%
一种方法是将NaN替换为空字符串,将整个df列转换为字符串并将其分组
df = df.replace(np.nan,'')
df = df.astype(str)
df.groupby(['id']).sum()
但这需要很长时间,因为groupby处理字符串需要花费很多时间。有一个更好的方法吗?
答案 0 :(得分:3)
让我们尝试将groupby
与first
一起使用,这将跳过NaN值
df=df.groupby('id').first().reset_index()