Question

我有一个Pandas数据框，如下所示：

 school_id  uni_id  points
 123        44      180
 123        45      160
 123        45      160
 123        48      110
 124        44      180
 124        45      160
 124        47      130
 123        48      120

生成如下，以帮助善良的回答者：

df = pd.DataFrame({ 
    'school_id': [123, 123, 123, 123, 124, 124, 124, 124], 
    'school_id': [44, 45, 45, 48, 44, 45, 47, 48], 
    'points': [180, 160, 160, 110, 180, 160, 130, 120]
})

我想添加百分位列，表示每所学校points值的百分位数。所以这个数据集看起来像这样：

 school_id  uni_id  points  percentile
 123        44      180     100
 123        45      160     50
 123        45      160     50
 123        48      110     0
 124        44      180     100
 124        45      160     66
 124        47      130     33
 123        48      120     0

最好的方法是什么？我假设我需要按school_id进行分组，然后以某种方式在每个子组中执行df.quantile()，然后取消组合？

更新：也许我需要从这样的事情开始...... df.groupby('school_id')['points'].rank(ascending=False)然后将等级除以每组的长度以将其归一化到0到100之间？

Answer 1

您可以在计算按pct=True分组的子组之间的数值数据排名时指定"school_id"，作为GroupBy.rank方法的附加参数：

df.assign(percentile=df.groupby("school_id")['points'].rank(pct=True).mul(100))

要检查 （对于一个实例）：

from scipy.stats import percentileofscore
df.groupby("school_id")['points'].apply(percentileofscore, 160)

school_id
123    70.000000
124    66.666667
Name: points, dtype: float64

Answer 2

你想在这里做几件事。

你希望你的排名密集
你希望最低为零，最高为100.我称之为包容性排名

我创建了一个单独的函数来应用。

def dense_inclusive_pct(x):
    # I subtract one to handle the inclusive bit
    r = x.rank(method='dense') - 1
    return r / r.max() * 100

df.assign(pct=df.groupby('school_id').points.apply(dense_inclusive_pct).astype(int))

   points  school_id  uni_id  pct
0     180        123      44  100
1     160        123      45   50
2     160        123      45   50
3     110        123      48    0
4     180        124      44  100
5     160        124      45   66
6     130        124      47   33
7     120        124      48    0

大熊猫：计算子群内的百分位数？

2 个答案: