Question

我需要将函数应用于数据框中的列子集。考虑以下玩具示例：

pdf = pd.DataFrame({'a' : [1, 2, 3], 'b' : [2, 3, 4], 'c' : [5, 6, 7]})
arb_cols = ['a', 'b']

我想做的是：

[df[c] = df[c].apply(lambda x : 99 if x == 2 else x) for c in arb_cols]

但这是错误的语法。没有for循环可以完成这样的任务吗？

Answer 1

带面具

pdf.mask(pdf.loc[:,arb_cols]==2,99).assign(c=pdf.c)
Out[1190]: 
    a   b  c
0   1  99  5
1  99   3  6
2   3   4  7

或者使用assign

pdf.assign(**pdf.loc[:,arb_cols].mask(pdf.loc[:,arb_cols]==2,99))
Out[1193]: 
    a   b  c
0   1  99  5
1  99   3  6
2   3   4  7

Answer 2

当您可以使用矢量化函数时，请不要使用pd.Series.apply。

例如，即使存在外部循环，下面对于较大的数据帧应该是有效的：

for col in arb_cols:
    pdf.loc[pdf[col] == 2, col] = 99

使用pd.DataFrame.replace的另一种选择：

pdf[arb_cols] = pdf[arb_cols].replace(2, 99)

另一种选择是使用numpy.where：

import numpy as np
pdf[arb_cols] = np.where(pdf[arb_cols] == 2, 99, pdf[arb_cols])

Answer 3

对于这种情况，如果您需要应用自定义函数，最好使用applymap

pdf[arb_cols] = pdf[arb_cols].applymap(lambda x : 99 if x == 2 else x)