Pandas根据基于其他列的条件添加值

时间:2018-05-16 16:36:46

标签: python pandas

我有以下pandas数据帧:

enter image description here

import pandas as pd
import numpy as np

d = {'age' : [21, 45, 45, 5],
     'salary' : [20, 40, 10, 100]}

df = pd.DataFrame(d)

并想添加一个名为" is_rich"的额外列。如果一个人根据他/她的工资而富裕,则可以捕获。我找到了多种方法来实现这一目标:

# method 1
df['is_rich_method1'] = np.where(df['salary']>=50, 'yes', 'no')

# method 2
df['is_rich_method2'] = ['yes' if x >= 50 else 'no' for x in df['salary']]

# method 3
df['is_rich_method3'] = 'no'
df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'

导致:

enter image description here

但是,我不明白首选方式是什么。根据您的应用,所有方法都同样好吗?

1 个答案:

答案 0 :(得分:6)

使用timeits,Luke!

小型DataFrame

%timeit np.where(df['salary']>=50, 'yes', 'no')
%timeit ['yes' if x >= 50 else 'no' for x in df['salary']]
%timeit df['is_rich_method3'] = 'no'; df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'

281 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
25.1 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
2.26 ms ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

最佳方法 - 列表理解(最少开销)

大型DataFrame

df = pd.concat([df] * 10000, ignore_index=True)

%timeit np.where(df['salary']>=50, 'yes', 'no')
%timeit ['yes' if x >= 50 else 'no' for x in df['salary']]
%timeit df['is_rich_method3'] = 'no'; df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'

907 µs ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.92 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.67 ms ± 67.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

最佳方法 - np.where (矢量化,大型数组的高性能)

总之,方法的适用性取决于您的数据。但是,您对于如何在C中实现竞争性列表理解感到惊讶,并且它们在性能方面得到了高度优化。