我有以下pandas数据帧:
import pandas as pd
import numpy as np
d = {'age' : [21, 45, 45, 5],
'salary' : [20, 40, 10, 100]}
df = pd.DataFrame(d)
并想添加一个名为" is_rich"的额外列。如果一个人根据他/她的工资而富裕,则可以捕获。我找到了多种方法来实现这一目标:
# method 1
df['is_rich_method1'] = np.where(df['salary']>=50, 'yes', 'no')
# method 2
df['is_rich_method2'] = ['yes' if x >= 50 else 'no' for x in df['salary']]
# method 3
df['is_rich_method3'] = 'no'
df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'
导致:
但是,我不明白首选方式是什么。根据您的应用,所有方法都同样好吗?
答案 0 :(得分:6)
使用timeits
,Luke!
%timeit np.where(df['salary']>=50, 'yes', 'no')
%timeit ['yes' if x >= 50 else 'no' for x in df['salary']]
%timeit df['is_rich_method3'] = 'no'; df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'
281 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
25.1 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
2.26 ms ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
最佳方法 - 列表理解(最少开销)
df = pd.concat([df] * 10000, ignore_index=True)
%timeit np.where(df['salary']>=50, 'yes', 'no')
%timeit ['yes' if x >= 50 else 'no' for x in df['salary']]
%timeit df['is_rich_method3'] = 'no'; df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'
907 µs ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.92 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.67 ms ± 67.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
最佳方法 - np.where
(矢量化,大型数组的高性能)
总之,方法的适用性取决于您的数据。但是,您对于如何在C中实现竞争性列表理解感到惊讶,并且它们在性能方面得到了高度优化。