我想计算不包括值99的平均年龄。在现实生活中,数据框要大得多,并且我还有其他可能的变量。
是否有更有效的方法(更快或更优雅)来做到这一点?也许有数据透视表或分组依据,还是有函数?
data = {'age': [99,45,34,32,34,67,5,6,7,8,3,5]}
df = pd.DataFrame(data, columns = ['age'])
not99 = df['age'] != 99
mean_for_age = df.loc[not99, 'age'].mean()
答案 0 :(得分:0)
numpy
解决方案更快-首先创建数组,然后过滤:
arr = df['age'].values
not99 = arr != 99
mean_for_age = arr[not99].mean()
但是,如果通常需要解决方案以选择另一列,请使用您的解决方案:
not99 = df['age'] != 99
mean_for_age = df.loc[not99, 'age'].mean()
mean_for_age = df.loc[not99, 'another col'].mean()
时间(取决于数据,对真实数据的最佳测试):
data = {'age': [99,45,34,32,34,67,5,6,7,8,3,5]}
df = pd.DataFrame(data, columns = ['age'])
df = pd.concat([df] * 10000, ignore_index=True)
In [14]: %%timeit
...: arr = df['age'].values
...: not99 = arr != 99
...:
...: mean_for_age = arr[not99].mean()
...:
496 µs ± 36.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: %%timeit
...: not99 = df['age'] != 99
...: mean_for_age = df.loc[not99, 'age'].mean()
...:
1.82 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: %%timeit
...: df.query("age != 99")['age'].mean()
...:
4.26 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)