我有一个包含两列的数据框,score
和order_amount
。我想找到代表order_amount
第X个百分位数的分数Y。即如果我将order_amount
的所有值加起来,其中score <= Y
,我将得到总数order_amount
的X%。
下面我有一个可行的解决方案,但似乎pandas
应该有一种更优雅的方式。
import pandas as pd
test_data = {'score': [0.3,0.1,0.2,0.4,0.8],
'value': [10,100,15,200,150]
}
df = pd.DataFrame(test_data)
df
score value
0 0.3 10
1 0.1 100
2 0.2 15
3 0.4 200
4 0.8 150
# Now we can order by `score` and use `cumsum` to calculate what we want
df_order = df.sort_values('score')
df_order['percentile_value'] = 100*df_order['value'].cumsum()/df_order['value'].sum()
df_order
score value percentile_value
1 0.1 100 21.052632
2 0.2 15 24.210526
0 0.3 10 26.315789
3 0.4 200 68.421053
4 0.8 150 100.000000
# Now can find the first value of score with percentile bigger than 50% (for example)
df_order[df_order['percentile_value']>50]['score'].iloc[0]
答案 0 :(得分:3)
idx = df_order['percentile_value'].searchsorted(50)
print (df_order.iloc[idx, df.columns.get_loc('score')])
0.4
或者如果没有匹配项返回一些默认值,则使用next
和iter
获取过滤后的Series的第一个值:
s = df_order.loc[df_order['percentile_value'] > 50, 'score']
print (next(iter(s), 'no match'))
0.4
单行解决方案:
out = next(iter((df.sort_values('score')
.assign(percentile_value = lambda x: 100*x['value'].cumsum()/x['value'].sum())
.query('percentile_value > 50')['score'])),'no matc')
print (out)
0.4
答案 1 :(得分:2)
这是使用np.percentile
从原始数据帧开始的另一种方法:
df = df.sort_values('score')
df.loc[np.searchsorted(df['value'],np.percentile(df['value'].cumsum(),50)),'score']
df.loc[np.searchsorted(df['value'],df['value'].cumsum().quantile(0.5)),'score']
如果iindex不是默认值,则与iloc相似:
df.iloc[np.searchsorted(df['value']
,np.percentile(df['value'].cumsum(),50)),df.columns.get_loc('score')]
0.4