想象一下,我的DataFrame
列只包含实际值。
>> df
col1 col2 col3
0 0.907609 82 4.207991
1 3.743659 1523 6.488842
2 2.358696 324 5.092592
3 0.006793 0 0.000000
4 19.319746 11969 7.405685
我想按所选列(例如,col1
)的四分位数(或我指定的任何其他百分位数)对其进行分组,以对这些组执行某些操作。理想情况下,我想做类似的事情:
df.groupy( quartiles_of_col1 ).mean() # not working, how to code quartiles_of_col1?
输出应该给出对应于col1
四分位数的四个组的每个列的平均值。使用groupby
命令可以实现吗?实现它的最简单方法是什么?
答案 0 :(得分:10)
我现在没有计算机来测试它,但我认为你可以通过以下方式进行测试:df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
。将在150分钟后更新。
一些解释:
In [42]:
#use np.percentile to get the bin edges of any percentile you want
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
0.907609,
3.7436589999999996,
13.089311200000001,
19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
col0 col1 col2
col0
[0.00679, 0.908] 0.457201 41.0 2.103996
(0.908, 3.744] 3.051177 923.5 5.790717
(3.744, 13.0893] NaN NaN NaN
(13.0893, 19.32] 19.319746 11969.0 7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
col0 col1 col2
col0
(0.00679, 0.908] 0.907609 82.0 4.207991
(0.908, 3.744] 3.051177 923.5 5.790717
(3.744, 13.0893] NaN NaN NaN
(13.0893, 19.32] 19.319746 11969.0 7.405685
答案 1 :(得分:0)
我希望这能解决你的问题。它不漂亮,但我希望它对你有用
import pandas as pd
import random
import numpy as np
## create a mock df as example. with column A, B, C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
## select dataframe based on the quantile of column A, using the quantile method.
df[df['A'] < df['A'].quantile(0.3)].mean()
这将打印
A -1.157615
B 0.205529
C -0.108263
D 0.346752
dtype: float64
答案 2 :(得分:0)
Pandas还有一个原生解决方案pandas.qcut
:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html