我有一个价值矩阵,需要将它们的份额分配到各个组的总和中。
示例:
我正在尝试代码:
import pandas as pd
df = pd.DataFrame({'id':['id_1', 'id_2','id_3','id_4','id_5','id_6','id_7','id_8','id_9'],
'region':['reg_1','reg_1','reg_1','reg_2','reg_2','reg_2','reg_3','reg_3','reg_3'],
'class_1':[5,8,2,5,5,4,6,5,3],
'class_2':[6,8,3,7,8,5,8,6,4],
'class_3':[7,8,4,4,3,6,7,9,8,]})
cols=list(df.iloc[:,2:].columns)
weights=df.iloc[:,2:].div(df.groupby(['region'])[cols].sum())
它不起作用。
我采用了地区/类别总和的矩阵
sum=df.set_index('id').groupby(['region']).sum()
但是我不知道该如何划分大小不同的矩阵。
请,有人可以帮忙吗?谢谢
答案 0 :(得分:3)
创建MultiIndex
,因此可以在DataFrame.div
中使用参数level
:
cols = df.columns[2:]
df1 = df.groupby(['region'])[cols].sum()
#another solution
#df1 = df.iloc[:,2:].groupby(df['region']).sum()
weights=df.set_index(['id','region']).div(df1, level='region').reset_index()
print (weights)
id region class_1 class_2 class_3
0 id_1 reg_1 0.333333 0.352941 0.368421
1 id_2 reg_1 0.533333 0.470588 0.421053
2 id_3 reg_1 0.133333 0.176471 0.210526
3 id_4 reg_2 0.357143 0.350000 0.307692
4 id_5 reg_2 0.357143 0.400000 0.230769
5 id_6 reg_2 0.285714 0.250000 0.461538
6 id_7 reg_3 0.428571 0.444444 0.291667
7 id_8 reg_3 0.357143 0.333333 0.375000
8 id_9 reg_3 0.214286 0.222222 0.333333
或者首先创建Multiindex
,因此也可以将sum
与level
参数一起使用:
df1=df.set_index(['id','region'])
weights = df1.div(df1.sum(level='region'), level='region').reset_index()
print (weights)
id region class_1 class_2 class_3
0 id_1 reg_1 0.333333 0.352941 0.368421
1 id_2 reg_1 0.533333 0.470588 0.421053
2 id_3 reg_1 0.133333 0.176471 0.210526
3 id_4 reg_2 0.357143 0.350000 0.307692
4 id_5 reg_2 0.357143 0.400000 0.230769
5 id_6 reg_2 0.285714 0.250000 0.461538
6 id_7 reg_3 0.428571 0.444444 0.291667
7 id_8 reg_3 0.357143 0.333333 0.375000
8 id_9 reg_3 0.214286 0.222222 0.333333
另一个想法是按位置过滤列,将DataFrame
的{{3}}用于cols = df.columns[2:]
df[cols] = df[cols].div(df.groupby('region')[cols].transform('sum'))
print (df)
id region class_1 class_2 class_3
0 id_1 reg_1 0.333333 0.352941 0.368421
1 id_2 reg_1 0.533333 0.470588 0.421053
2 id_3 reg_1 0.133333 0.176471 0.210526
3 id_4 reg_2 0.357143 0.350000 0.307692
4 id_5 reg_2 0.357143 0.400000 0.230769
5 id_6 reg_2 0.285714 0.250000 0.461538
6 id_7 reg_3 0.428571 0.444444 0.291667
7 id_8 reg_3 0.357143 0.333333 0.375000
8 id_9 reg_3 0.214286 0.222222 0.333333
,其大小与原始大小相同,因此可以进行划分和分配:
Performance
编辑:
np.random.seed(123)
N = 1000
L = list('abcdefghijklmno')
df1 = pd.DataFrame({'id': np.arange(N*len(L)),
'region': np.repeat(L, N)})
df = df1.join(pd.DataFrame(np.random.randint(100, size=(N*len(L), 5))).add_prefix('class_'))
print (df)
for @Brendam Cox:
In [349]: %%timeit
...: cols = df.columns[2:]
...: df1 = df.groupby(['region'])[cols].sum()
...: #another solution
...: #df1 = df.iloc[:,2:].groupby(df['region']).sum()
...: weights=df.set_index(['id','region']).div(df1, level='region').reset_index()
...:
...:
13.9 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [350]: %%timeit
...: df1=df.set_index(['id','region'])
...: weights = df1.div(df1.sum(level='region'), level='region').reset_index()
...:
13.8 ms ± 595 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [351]: %%timeit
...: cols = df.columns[2:]
...: df[cols] = df[cols].div(df.groupby('region')[cols].transform('sum'))
...:
8.99 ms ± 602 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [352]: %%timeit
...: (df.set_index(['id','region'])
...: .groupby('region')
...: .apply(lambda x: x/x.sum()
...: )
...: )
...:
49.5 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
<div class="pi--secondary-price">
<span class="pi--price">$11.99 /<abbr title="Kilogram">kg</abbr></span>
<span class="pi--price">$5.44 /<abbr title="Pound">lb.</abbr></span>
</div>
答案 1 :(得分:0)
申请也可以:
(df.set_index(['id','region'])
.groupby('region')
.apply(lambda x: x/x.sum())
)
输出:
class_1 class_2 class_3
id region
id_1 reg_1 0.333333 0.352941 0.368421
id_2 reg_1 0.533333 0.470588 0.421053
id_3 reg_1 0.133333 0.176471 0.210526
id_4 reg_2 0.357143 0.350000 0.307692
id_5 reg_2 0.357143 0.400000 0.230769
id_6 reg_2 0.285714 0.250000 0.461538
id_7 reg_3 0.428571 0.444444 0.291667
id_8 reg_3 0.357143 0.333333 0.375000
id_9 reg_3 0.214286 0.222222 0.333333