Question

我有一个这样的数据框：

df = pd.DataFrame([[1,2],
                   [1,4],
                   [1,5],
                   [2,65],
                   [2,34],
                   [2,23],
                   [2,45]], columns = ['label', 'score'])

是否有一种有效的方法来创建一列score_winsor，该列以1％的水平对组内的得分列进行配色？

我尝试了这个但没有成功：

df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))

Answer 1

您可以使用scipy's implementation of winsorize

df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))

输出

>>> df
   label  score  score_winsor
0      1      2             2
1      1      4             4
2      1      5             5
3      2     65            65
4      2     34            34
5      2     23            23
6      2     45            45

Answer 2

这有效：

df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))

输出

print(df.to_string())

   label  score  score_winsor
0      1      2          2.04
1      1      4          4.00
2      1      5          4.98
3      2     65         64.40
4      2     34         34.00
5      2     23         23.33
6      2     45         45.00

在数据框组内进行Winsorize

2 个答案: