如何在观察具有相同属性的其他行时调整比例

时间:2019-05-10 21:12:15

标签: python pandas dataframe

请考虑以下数据框:

tdf=pd.DataFrame({'City':['NY','NY','NY','NY','NY','CA','CA','CA','CA','CA','CA'],'PRJ':['A','B','C','D','E','F','GG','GG','I','J','K'],'Year':[2011,2012,2013,2014,2015,2011,2012,2012,2013,2014,2015],'EXPECTED':[2,3,4,6.1,7,7.1,8,3,10,11,11],'ACTUAL':[0.5,1.8,2.7,5.1,5.8,6.8,10,10,8,8.1,8.2]})

enter image description here

我的目标是添加一个ratio=actual/expected.,如果我没有项目GG,那将是一件微不足道的事情:

tdf['Ratio']=tdf['ACTUAL']/tdf['EXPECTED']

enter image description here

鉴于这一挑战,我想做的是添加另一列 ACTUAL_ADJUSTED ,在该列中,我像这样按比例评估ACTUAL

prj_ratio = 10/(8+3) = 0.909
gg6_actual = (0.909*8)=7.272
gg7_actual = (0.909*3)=2.727

我尝试了什么? 我建立了一个函数

def make_adjustments(r): 
    s = tdf[(tdf['City']==r['City']) & (tdf['Year']==r['Year']) ]
    if len(s)>1:
        return "problem here" 
    else:
        return 'ok'


tdf['ACTUAL_ADJUSTED'] = tdf.apply(make_adjustments,axis=1)

此功能将识别问题,但实际上(在我的真实数据中)需要大量时间。所以我得出结论,我走错了路。任何想法如何解决这个问题?

2 个答案:

答案 0 :(得分:1)

如果将transformnunique一起使用,则np.where

s=tdf.groupby(['City','PRJ'])['EXPECTED'].transform('nunique')
s1=tdf.groupby(['City','PRJ'])['EXPECTED'].transform('sum')
tdf['ACTUAL_ADJUSTED']=np.where(s>1,'problem here','ok')
tdf['value']=np.where(s==1,tdf.ACTUAL/df.EXPECTED,tdf.ACTUAL/s1*tdf.EXPECTED)

tdf
Out[728]: 
   City PRJ  Year  EXPECTED  ACTUAL     Ratio ACTUAL_ADJUSTED     value
0    NY   A  2011       2.0     0.5  0.250000              ok  0.250000
1    NY   B  2012       3.0     1.8  0.600000              ok  0.600000
2    NY   C  2013       4.0     2.7  0.675000              ok  0.675000
3    NY   D  2014       6.1     5.1  0.836066              ok  0.836066
4    NY   E  2015       7.0     5.8  0.828571              ok  0.828571
5    CA   F  2011       7.1     6.8  0.957746              ok  0.957746
6    CA  GG  2012       8.0    10.0  1.250000    problem here  7.272727
7    CA  GG  2012       3.0    10.0  3.333333    problem here  2.727273
8    CA   I  2013      10.0     8.0  0.800000              ok  0.800000
9    CA   J  2014      11.0     8.1  0.736364              ok  0.736364
10   CA   K  2015      11.0     8.2  0.745455              ok  0.745455

答案 1 :(得分:1)

尝试:

def adjust(x):
    if len(x)==1:
        return x['ACTUAL']/x['EXPECTED']
    else:
        return x['ACTUAL'] * x['EXPECTED'] / x['EXPECTED'].sum()


tdf['RATIO'] = (tdf.groupby(['City', 'Year'])
                .apply(adjust).
                reset_index(level=[0,1], drop=True)
               )

输出:

+-----+-------+------+-------+-----------+---------+----------+
|     | City  | PRJ  | Year  | EXPECTED  | ACTUAL  |  RATIO   |
+-----+-------+------+-------+-----------+---------+----------+
|  0  | NY    | A    | 2011  | 2.0       | 0.5     | 0.250000 |
|  1  | NY    | B    | 2012  | 3.0       | 1.8     | 0.600000 |
|  2  | NY    | C    | 2013  | 4.0       | 2.7     | 0.675000 |
|  3  | NY    | D    | 2014  | 6.1       | 5.1     | 0.836066 |
|  4  | NY    | E    | 2015  | 7.0       | 5.8     | 0.828571 |
|  5  | CA    | F    | 2011  | 7.1       | 6.8     | 0.957746 |
|  6  | CA    | GG   | 2012  | 8.0       | 10.0    | 7.272727 |
|  7  | CA    | GG   | 2012  | 3.0       | 10.0    | 2.727273 |
|  8  | CA    | I    | 2013  | 10.0      | 8.0     | 0.800000 |
|  9  | CA    | J    | 2014  | 11.0      | 8.1     | 0.736364 |
| 10  | CA    | K    | 2015  | 11.0      | 8.2     | 0.745455 |
+-----+-------+------+-------+-----------+---------+----------+

或者如果您要在示例中使用“ ACTUAL_ADJUSTED”列:

tdf['ACTUAL_ADJUSTED'] = (tdf.groupby(['City', 'Year'])
                          .ACTUAL.transform(lambda x: 
                                            'OK' if len(x)==1 
                                                 else 'problem here')
                         )