值百分比对按列条件分组的值总和的矩阵

时间:2019-06-26 13:30:58

标签: python pandas

我有一个价值矩阵,需要将它们的份额分配到各个组的总和中。

示例:

enter image description here

需要获取-一个类中每个id占类/区域总数的百分比矩阵 enter image description here

我正在尝试代码:

import pandas as pd
df = pd.DataFrame({'id':['id_1', 'id_2','id_3','id_4','id_5','id_6','id_7','id_8','id_9'],
               'region':['reg_1','reg_1','reg_1','reg_2','reg_2','reg_2','reg_3','reg_3','reg_3'],
               'class_1':[5,8,2,5,5,4,6,5,3],
               'class_2':[6,8,3,7,8,5,8,6,4],
               'class_3':[7,8,4,4,3,6,7,9,8,]})
cols=list(df.iloc[:,2:].columns)
weights=df.iloc[:,2:].div(df.groupby(['region'])[cols].sum())

它不起作用。

我采用了地区/类别总和的矩阵

sum=df.set_index('id').groupby(['region']).sum()

但是我不知道该如何划分大小不同的矩阵。

请,有人可以帮忙吗?谢谢

2 个答案:

答案 0 :(得分:3)

创建MultiIndex,因此可以在DataFrame.div中使用参数level

cols = df.columns[2:]
df1 = df.groupby(['region'])[cols].sum()
#another solution
#df1 = df.iloc[:,2:].groupby(df['region']).sum()
weights=df.set_index(['id','region']).div(df1, level='region').reset_index()
print (weights)
     id region   class_1   class_2   class_3
0  id_1  reg_1  0.333333  0.352941  0.368421
1  id_2  reg_1  0.533333  0.470588  0.421053
2  id_3  reg_1  0.133333  0.176471  0.210526
3  id_4  reg_2  0.357143  0.350000  0.307692
4  id_5  reg_2  0.357143  0.400000  0.230769
5  id_6  reg_2  0.285714  0.250000  0.461538
6  id_7  reg_3  0.428571  0.444444  0.291667
7  id_8  reg_3  0.357143  0.333333  0.375000
8  id_9  reg_3  0.214286  0.222222  0.333333

或者首先创建Multiindex,因此也可以将sumlevel参数一起使用:

df1=df.set_index(['id','region'])
weights = df1.div(df1.sum(level='region'), level='region').reset_index()
print (weights)
     id region   class_1   class_2   class_3
0  id_1  reg_1  0.333333  0.352941  0.368421
1  id_2  reg_1  0.533333  0.470588  0.421053
2  id_3  reg_1  0.133333  0.176471  0.210526
3  id_4  reg_2  0.357143  0.350000  0.307692
4  id_5  reg_2  0.357143  0.400000  0.230769
5  id_6  reg_2  0.285714  0.250000  0.461538
6  id_7  reg_3  0.428571  0.444444  0.291667
7  id_8  reg_3  0.357143  0.333333  0.375000
8  id_9  reg_3  0.214286  0.222222  0.333333

另一个想法是按位置过滤列,将DataFrame的{​​{3}}用于cols = df.columns[2:] df[cols] = df[cols].div(df.groupby('region')[cols].transform('sum')) print (df) id region class_1 class_2 class_3 0 id_1 reg_1 0.333333 0.352941 0.368421 1 id_2 reg_1 0.533333 0.470588 0.421053 2 id_3 reg_1 0.133333 0.176471 0.210526 3 id_4 reg_2 0.357143 0.350000 0.307692 4 id_5 reg_2 0.357143 0.400000 0.230769 5 id_6 reg_2 0.285714 0.250000 0.461538 6 id_7 reg_3 0.428571 0.444444 0.291667 7 id_8 reg_3 0.357143 0.333333 0.375000 8 id_9 reg_3 0.214286 0.222222 0.333333 ,其大小与原始大小相同,因此可以进行划分和分配:

Performance

编辑:

np.random.seed(123) N = 1000 L = list('abcdefghijklmno') df1 = pd.DataFrame({'id': np.arange(N*len(L)), 'region': np.repeat(L, N)}) df = df1.join(pd.DataFrame(np.random.randint(100, size=(N*len(L), 5))).add_prefix('class_')) print (df) for @Brendam Cox:

In [349]: %%timeit
     ...: cols = df.columns[2:]
     ...: df1 = df.groupby(['region'])[cols].sum()
     ...: #another solution
     ...: #df1 = df.iloc[:,2:].groupby(df['region']).sum()
     ...: weights=df.set_index(['id','region']).div(df1, level='region').reset_index()
     ...: 
     ...: 
13.9 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [350]: %%timeit
     ...: df1=df.set_index(['id','region'])
     ...: weights = df1.div(df1.sum(level='region'), level='region').reset_index()
     ...: 
13.8 ms ± 595 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [351]: %%timeit
     ...: cols = df.columns[2:]
     ...: df[cols] = df[cols].div(df.groupby('region')[cols].transform('sum'))
     ...: 
8.99 ms ± 602 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [352]: %%timeit
     ...: (df.set_index(['id','region'])
     ...:    .groupby('region')
     ...:    .apply(lambda x: x/x.sum()
     ...:    )
     ...: )
     ...: 
49.5 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

<div class="pi--secondary-price">
<span class="pi--price">$11.99 /<abbr title="Kilogram">kg</abbr></span>
<span class="pi--price">$5.44 /<abbr title="Pound">lb.</abbr></span>
</div>

答案 1 :(得分:0)

申请也可以:

(df.set_index(['id','region'])
   .groupby('region')
   .apply(lambda x: x/x.sum())
)

输出:

              class_1   class_2   class_3
id   region                              
id_1 reg_1   0.333333  0.352941  0.368421
id_2 reg_1   0.533333  0.470588  0.421053
id_3 reg_1   0.133333  0.176471  0.210526
id_4 reg_2   0.357143  0.350000  0.307692
id_5 reg_2   0.357143  0.400000  0.230769
id_6 reg_2   0.285714  0.250000  0.461538
id_7 reg_3   0.428571  0.444444  0.291667
id_8 reg_3   0.357143  0.333333  0.375000
id_9 reg_3   0.214286  0.222222  0.333333