Groupby给出所选DataFrame列的值的百分位数

时间:2014-07-09 15:02:28

标签: python pandas group-by

想象一下,我的DataFrame列只包含实际值。

>> df        
          col1   col2      col3  
0     0.907609     82  4.207991 
1     3.743659   1523  6.488842 
2     2.358696    324  5.092592  
3     0.006793      0  0.000000  
4    19.319746  11969  7.405685 

我想按所选列(例如,col1)的四分位数(或我指定的任何其他百分位数)对其进行分组,以对这些组执行某些操作。理想情况下,我想做类似的事情:

df.groupy( quartiles_of_col1 ).mean()  # not working, how to code quartiles_of_col1?

输出应该给出对应于col1四分位数的四个组的每个列的平均值。使用groupby命令可以实现吗?实现它的最简单方法是什么?

3 个答案:

答案 0 :(得分:10)

我现在没有计算机来测试它,但我认为你可以通过以下方式进行测试:df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()。将在150分钟后更新。

一些解释:

In [42]:
#use np.percentile to get the bin edges of any percentile you want 
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
 0.907609,
 3.7436589999999996,
 13.089311200000001,
 19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
                       col0     col1      col2
col0                                          
[0.00679, 0.908]   0.457201     41.0  2.103996
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
                       col0     col1      col2
col0                                          
(0.00679, 0.908]   0.907609     82.0  4.207991
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685

答案 1 :(得分:0)

我希望这能解决你的问题。它不漂亮,但我希望它对你有用

    import pandas as pd
    import random 
    import numpy as np
    ## create a mock df as example. with column A, B, C and D
    df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

    ## select dataframe based on the quantile of column A, using the quantile method.
    df[df['A'] < df['A'].quantile(0.3)].mean()

这将打印

A   -1.157615
B    0.205529
C   -0.108263
D    0.346752
dtype: float64

答案 2 :(得分:0)

Pandas还有一个原生解决方案pandas.qcut

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html