如何根据条件替换多列中的值?
假设我有一个看起来像这样的df:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
使用numpy,我可以根据以下条件更改列的值:
df['A'] = np.where((df['B'] < 5), '-', df['A'])
但是如何根据条件更改许多列的值?以为我可以做下面的事情,但是那没用。
df[['A','C']] = np.where((df['B'] < 5), '-', df[['A', 'C']])
我可以做一个循环,但是感觉不太pythonic / pands
cols = ['A', 'C']
for col in cols:
df[col] = np.where((df['B'] < 5), '-', df[col])
答案 0 :(得分:3)
一个想法是使用DataFrame.mask
:
df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
print (df)
A C B
0 - - 3
1 - - 4
2 3 3 6
3 4 4 6
带有DataFrame.loc
的替代解决方案:
df.loc[df['B'] < 5, ['A','C']] = '-'
print (df)
A C B
0 - - 3
1 - - 4
2 3 3 6
3 4 4 6
使用numpy.where
和广播掩码的解决方案:
df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
性能(如果是混合值)-带有字符串-
的数字:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
#400k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [217]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
171 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [219]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
72.5 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [221]: %timeit df.loc[df['B'] < 5, ['A','C']] = '-'
27.8 ms ± 533 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
性能,如果用数字代替:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
df = pd.concat([df] * 100000, ignore_index=True)
In [229]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, 0)
187 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [231]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], 0, df[['A', 'C']])
20.8 ms ± 455 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [233]: %timeit df.loc[df['B'] < 5, ['A','C']] = 0
61.3 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)