请考虑以下数据框:
tdf=pd.DataFrame({'City':['NY','NY','NY','NY','NY','CA','CA','CA','CA','CA','CA'],'PRJ':['A','B','C','D','E','F','GG','GG','I','J','K'],'Year':[2011,2012,2013,2014,2015,2011,2012,2012,2013,2014,2015],'EXPECTED':[2,3,4,6.1,7,7.1,8,3,10,11,11],'ACTUAL':[0.5,1.8,2.7,5.1,5.8,6.8,10,10,8,8.1,8.2]})
我的目标是添加一个ratio=actual/expected.
,如果我没有项目GG,那将是一件微不足道的事情:
tdf['Ratio']=tdf['ACTUAL']/tdf['EXPECTED']
鉴于这一挑战,我想做的是添加另一列 ACTUAL_ADJUSTED ,在该列中,我像这样按比例评估ACTUAL :
prj_ratio = 10/(8+3) = 0.909
gg6_actual = (0.909*8)=7.272
gg7_actual = (0.909*3)=2.727
我尝试了什么? 我建立了一个函数
def make_adjustments(r):
s = tdf[(tdf['City']==r['City']) & (tdf['Year']==r['Year']) ]
if len(s)>1:
return "problem here"
else:
return 'ok'
tdf['ACTUAL_ADJUSTED'] = tdf.apply(make_adjustments,axis=1)
此功能将识别问题,但实际上(在我的真实数据中)需要大量时间。所以我得出结论,我走错了路。任何想法如何解决这个问题?
答案 0 :(得分:1)
如果将transform
与nunique
一起使用,则np.where
s=tdf.groupby(['City','PRJ'])['EXPECTED'].transform('nunique')
s1=tdf.groupby(['City','PRJ'])['EXPECTED'].transform('sum')
tdf['ACTUAL_ADJUSTED']=np.where(s>1,'problem here','ok')
tdf['value']=np.where(s==1,tdf.ACTUAL/df.EXPECTED,tdf.ACTUAL/s1*tdf.EXPECTED)
tdf
Out[728]:
City PRJ Year EXPECTED ACTUAL Ratio ACTUAL_ADJUSTED value
0 NY A 2011 2.0 0.5 0.250000 ok 0.250000
1 NY B 2012 3.0 1.8 0.600000 ok 0.600000
2 NY C 2013 4.0 2.7 0.675000 ok 0.675000
3 NY D 2014 6.1 5.1 0.836066 ok 0.836066
4 NY E 2015 7.0 5.8 0.828571 ok 0.828571
5 CA F 2011 7.1 6.8 0.957746 ok 0.957746
6 CA GG 2012 8.0 10.0 1.250000 problem here 7.272727
7 CA GG 2012 3.0 10.0 3.333333 problem here 2.727273
8 CA I 2013 10.0 8.0 0.800000 ok 0.800000
9 CA J 2014 11.0 8.1 0.736364 ok 0.736364
10 CA K 2015 11.0 8.2 0.745455 ok 0.745455
答案 1 :(得分:1)
尝试:
def adjust(x):
if len(x)==1:
return x['ACTUAL']/x['EXPECTED']
else:
return x['ACTUAL'] * x['EXPECTED'] / x['EXPECTED'].sum()
tdf['RATIO'] = (tdf.groupby(['City', 'Year'])
.apply(adjust).
reset_index(level=[0,1], drop=True)
)
输出:
+-----+-------+------+-------+-----------+---------+----------+
| | City | PRJ | Year | EXPECTED | ACTUAL | RATIO |
+-----+-------+------+-------+-----------+---------+----------+
| 0 | NY | A | 2011 | 2.0 | 0.5 | 0.250000 |
| 1 | NY | B | 2012 | 3.0 | 1.8 | 0.600000 |
| 2 | NY | C | 2013 | 4.0 | 2.7 | 0.675000 |
| 3 | NY | D | 2014 | 6.1 | 5.1 | 0.836066 |
| 4 | NY | E | 2015 | 7.0 | 5.8 | 0.828571 |
| 5 | CA | F | 2011 | 7.1 | 6.8 | 0.957746 |
| 6 | CA | GG | 2012 | 8.0 | 10.0 | 7.272727 |
| 7 | CA | GG | 2012 | 3.0 | 10.0 | 2.727273 |
| 8 | CA | I | 2013 | 10.0 | 8.0 | 0.800000 |
| 9 | CA | J | 2014 | 11.0 | 8.1 | 0.736364 |
| 10 | CA | K | 2015 | 11.0 | 8.2 | 0.745455 |
+-----+-------+------+-------+-----------+---------+----------+
或者如果您要在示例中使用“ ACTUAL_ADJUSTED”列:
tdf['ACTUAL_ADJUSTED'] = (tdf.groupby(['City', 'Year'])
.ACTUAL.transform(lambda x:
'OK' if len(x)==1
else 'problem here')
)