我有一个带有风险数字的投资组合数据框架。我想通过下面数据框中的“端口”列进行分组,然后用该投资组合组的中位数替换“风险”列中的值,该列大于其组的95%位数
df =
UPDATE
p_portfolio p INNER JOIN
(SELECT SUM(ppc.estimated_itc_value_c) as estimated_itc_value,pc.id_c,ppp1.deleted as ppp1_deleted,pp.deleted as pp_deleted
FROM p_portfolio_cstm pc
LEFT JOIN p_portfolio_p_purchaser_projects_1_c ppp1
ON p.id = ppp1.p_portfolio_p_purchaser_projects_1p_portfolio_ida
LEFT JOIN p_purchaser_projects pp
ON pp.id = ppp1.p_portfolio_p_purchaser_projects_1p_purchaser_projects_idb
LEFT JOIN p_purchaser_projects_cstm ppc
ON pp.id = ppc.id_c
) t2 ON p.id = t2.id_c
SET
pc.requested_itc_value_c = t2.estimated_itc_value
WHERE p.id = '4e9c9ea3-0880-4dc1-1063-5cbf71bd93bb'
AND p.deleted = 0 AND t2.ppp1_deleted = 0 AND t2.pp_deleted = 0;
我尝试了以下在stackoverflow上找到的代码,但是它不起作用。
Date Port Risk
2019-04-30 a 21.8
2019-03-29 a 22.6
2019-02-28 a 500
2019-01-31 a 26.1
2019-04-30 b 36.4
2019-03-29 b 43.3
2019-02-28 b 40
2019-01-31 b 364
也尝试过
def replace(group):
q = group.quantile(0.95)
outlier = group>q
group[outlier] = group.median()
return group
df.groupby('Port').transform(replace)
预期结果是将端口“ a”的第三条记录替换为组“ a”的中位数22.2,将端口“ b”的第四条记录替换为组“ b”的中位数41.6
df =
q = pd.DataFrame(df.groupby('Port')['Risk'].quantile(0.95))
df.loc[(((q.loc[df.Port,'Risk']<df['Risk'].values)))]=q.loc[df.Port,'Risk']
答案 0 :(得分:2)
中位数似乎与您所说的略有不同(请参见输出数据框中的注释)。这是将GroupBy.transform
与where
g = df.groupby('Port').Risk
df['Risk'] = (df.Risk.where(g.transform('quantile', q=0.95) > df.Risk,
g.transform('median')))
Date Port Risk
0 2019-04-30 a 21.80
1 2019-03-29 a 22.60
2 2019-02-28 a 24.35 # -> np.median([21.8, 22.6, 500, 26.1]) = 24.35
3 2019-01-31 a 26.10
4 2019-04-30 b 36.40
5 2019-03-29 b 43.30
6 2019-02-28 b 40.00
7 2019-01-31 b 41.65
答案 1 :(得分:2)
坚持您发布的代码:
def replace(group):
q = group.quantile(0.95)
outlier = group>q
group[outlier] = group.median()
return group
df['Risk'] = (df.groupby('Port').transform(replace))
print(df)
输出:
Date Port Risk
0 2019-04-30 a 21.80
1 2019-03-29 a 22.60
2 2019-02-28 a 24.35
3 2019-01-31 a 26.10
4 2019-04-30 b 36.40
5 2019-03-29 b 43.30
6 2019-02-28 b 40.00
7 2019-01-31 b 41.65
答案 2 :(得分:1)
这是一种实现方法:
df = pd.DataFrame({"Port" : ['a', 'a', 'a', 'a', 'b', 'b', 'b' ,'b'],
"Risk" : [21.8, 22.6, 500, 26.1, 36.4,43.3,40,364]
})
for port in df['Port'].unique():
mask_port = df['Port'] == port
quantile_port = df[mask_port].quantile(0.95)
median_port = df[mask_port].median()
df.loc[(mask_port) & (df['Risk']>quantile_port.Risk), 'Risk'] = median_port.Risk
In [1] : print(df)
Out[1] : Port Risk
0 a 21.80
1 a 22.60
2 a 24.35
3 a 26.10
4 b 36.40
5 b 43.30
6 b 40.00
7 b 41.65