Question

我正在尝试加快我的一项功能。我已经读过“矢量化”是在 Pandas 中运行这些类型操作的最快方法，但是如何使用此代码实现（或其他任何更快）：

虚拟数据

a = pd.DataFrame({'var1' : [33, 75, 464, 88, 34], 'imp_flag' : [1, 0, 0, 1, 1], 'donor_index' : [3, np.nan, np.nan, 4, 0]})

>>> a
   var1  imp_flag  donor_index
0    33         1          3.0
1    75         0          NaN
2   464         0          NaN
3    88         1          4.0
4    34         1          0.0

预期输出

>>> b
   var1  imp_flag  donor_index
1    75         0          NaN
1    75         0          NaN
2   464         0          NaN

Answer 1

有几件事我不明白

如果创建a

时没有Nan值，你的dataFrame怎么可能在donor_index中返回Nan值

a = pd.DataFrame({'var1' : [33, 75, 464, 88, 34], 'imp_flag' : [1, 0, 0, 1, 1], 'donor_index' : [3, 1, 2, 4, 0]})

>>> a
   var1  imp_flag  donor_index
0    33         1            3
1    75         0            1
2   464         0            2
3    88         1            4
4    34         1            0

您确定在您的示例中选择 a[a['imp_flag'] == 1] 是正确的吗？您在 b 上获得结果的方式似乎正好相反 a[a['imp_flag'] == 0]

那么，你真的需要 dataFrame b 中的重复值吗？

我的解决方案如下：

idxs = a[a.imp_flag == 0].donor_index
b = a.iloc[idxs]
# or in one-line b = a.iloc[a[a.imp_flag == 0].donor_index]

>>> b
   var1  imp_flag  donor_index
1    75         0            1
2   464         0            2

Answer 2

IIUC，如果 imp_flag 等于 1，您想要子选择行

您可以简单地使用 query 来匹配相关行：

b = a.loc[a.query('imp_flag == 1')['donor_index']]

或者，您可以index and select your data：

b = a.loc[a[a['imp_flag'] == 1]['donor_index']]

输出：

   var1  imp_flag  donor_index
3    88         1            4
4    34         1            0
0    33         1            3

Pandas Iterrows：更快的替代方案？

虚拟数据

相关操作

预期输出

2 个答案: