Question

想象一下，我的DataFrame列只包含实际值。

>> df        
          col1   col2      col3  
0     0.907609     82  4.207991 
1     3.743659   1523  6.488842 
2     2.358696    324  5.092592  
3     0.006793      0  0.000000  
4    19.319746  11969  7.405685

我想按所选列（例如，col1）的四分位数（或我指定的任何其他百分位数）对其进行分组，以对这些组执行某些操作。理想情况下，我想做类似的事情：

df.groupy( quartiles_of_col1 ).mean()  # not working, how to code quartiles_of_col1?

输出应该给出对应于col1四分位数的四个组的每个列的平均值。使用groupby命令可以实现吗？实现它的最简单方法是什么？

Answer 1

我现在没有计算机来测试它，但我认为你可以通过以下方式进行测试：df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()。将在150分钟后更新。

一些解释：

In [42]:
#use np.percentile to get the bin edges of any percentile you want 
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
 0.907609,
 3.7436589999999996,
 13.089311200000001,
 19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
                       col0     col1      col2
col0                                          
[0.00679, 0.908]   0.457201     41.0  2.103996
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
                       col0     col1      col2
col0                                          
(0.00679, 0.908]   0.907609     82.0  4.207991
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685

Answer 2

我希望这能解决你的问题。它不漂亮，但我希望它对你有用

    import pandas as pd
    import random 
    import numpy as np
    ## create a mock df as example. with column A, B, C and D
    df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

    ## select dataframe based on the quantile of column A, using the quantile method.
    df[df['A'] < df['A'].quantile(0.3)].mean()

这将打印

A   -1.157615
B    0.205529
C   -0.108263
D    0.346752
dtype: float64

Answer 3

Pandas还有一个原生解决方案pandas.qcut：

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html

Groupby给出所选DataFrame列的值的百分位数

3 个答案: