import pandas as pd
import numpy as np
import random
import string
N = 100
J = [2012,2013,2014]
K = ['A','B','C','D','E','F','G','H']
L = ['h','d','a']
df = pd.DataFrame(
np.random.uniform(1,10,size=(N, 3)),
columns=list('XYZ')
)
df['ht'] = pd.Series(random.choice(K) for _ in range(N))
df['at'] = pd.Series(random.choice(K) for _ in range(N))
df['J'] = pd.Series(random.choice(J) for _ in range(N))
df['R'] = pd.Series(random.choice(L) for _ in range(N))
df1 = (df.X).groupby([df.ht, df.J]).agg(['sum', 'size']).unstack(fill_value=0)
print(df.head())
我想创建一个新的数据框,其中列' X'将聚集在10个平均箱中。然后需要每个群集每年计算一笔总和:' R' R' *' X',其中' R'是' h。
EDIT;
期望的结果的例子:
箱/ 2012 /二千零十四分之二千零十三/ Total_sum_years / TOTAL_NUMBER _' H'
0< 1.5 / 15/8/5/28/7
答案 0 :(得分:3)
新猜测
df_agg = df.groupby([pd.cut(df.X, 10), 'R', 'J'])['X'].agg(['sum', 'size'])\
.unstack('J', fill_value=0)\
.reset_index('R')
df_agg = df_agg.loc[df_agg['R'] == 'h']
df_agg['total_sum_years'] = df_agg['sum'].sum(1)
df_agg['total_number_h'] = df_agg['size'].sum(1)
输出
R sum size \
J 2012 2013 2014 2012 2013 2014
X
(1.203, 2.0842] h 1.421185 2.660724 3.380401 1 2 2
(2.0842, 2.956] h 4.984133 5.044891 0.000000 2 2 0
(2.956, 3.828] h 0.000000 3.190256 6.644137 0 1 2
(3.828, 4.7] h 4.086577 0.000000 0.000000 1 0 0
(4.7, 5.572] h 0.000000 9.595351 0.000000 0 2 0
(5.572, 6.444] h 11.659066 6.037559 12.452256 2 1 2
(6.444, 7.316] h 6.535510 0.000000 6.820929 1 0 1
(7.316, 8.188] h 0.000000 0.000000 23.259200 0 0 3
(8.188, 9.0605] h 8.944386 8.352764 25.645607 1 1 3
(9.0605, 9.933] h 18.863608 29.606962 9.222994 2 3 1
total_sum_years total_number_h
J
X
(1.203, 2.0842] 7.462311 5
(2.0842, 2.956] 10.029024 4
(2.956, 3.828] 9.834393 3
(3.828, 4.7] 4.086577 1
(4.7, 5.572] 9.595351 2
(5.572, 6.444] 30.148881 5
(6.444, 7.316] 13.356440 2
(7.316, 8.188] 23.259200 3
(8.188, 9.0605] 42.942758 5
(9.0605, 9.933] 57.693565 6
我在这里拍摄。仍然不确定你想要什么。我首先将您的数据框创建重新编写为更好。
N = 100
J = [2012,2013,2014]
K = ['A','B','C','D','E','F','G','H']
L = ['h','d','a']
df = pd.DataFrame(
{'X': np.random.uniform(1,10,N),
'Y': np.random.uniform(1,10,N),
'Z': np.random.uniform(1,10,N),
'ht':np.random.choice(K, N),
'at':np.random.choice(K, N),
'J':np.random.choice(J, N),
'R':np.random.choice(L, N)
})
然后我将X与pd.cut分组,并按J和ht
分组df.groupby([pd.cut(df.X, 10), 'ht', 'J'])['X'].sum().unstack('J', fill_value=0)
产生以下内容。我认为这可以让你接近你想要的东西。
J 2012 2013 2014
X ht
(1.203, 2.0842] A 2.076946 1.360880 1.544429
B 0.000000 0.000000 1.434798
C 0.000000 1.313596 1.835972
D 0.000000 1.212149 1.280920
F 0.000000 0.000000 3.545768
H 1.421185 1.299844 0.000000
(2.0842, 2.956] A 2.331453 0.000000 0.000000
B 0.000000 2.489030 0.000000
C 5.689538 2.555860 0.000000
D 5.338711 0.000000 0.000000
H 0.000000 0.000000 6.428545
(2.956, 3.828] A 0.000000 6.692342 0.000000
B 0.000000 0.000000 3.211878
C 0.000000 3.673353 3.062432
D 0.000000 0.000000 3.432259
E 3.789112 0.000000 3.612064
G 0.000000 3.190256 3.117251
(3.828, 4.7] E 8.758016 0.000000 4.302206
G 4.086577 0.000000 0.000000
(4.7, 5.572] A 5.268921 0.000000 0.000000
B 0.000000 4.845556 0.000000
C 0.000000 4.990270 5.078201
E 0.000000 4.749795 0.000000
F 0.000000 0.000000 5.260480
G 4.811551 0.000000 0.000000
H 0.000000 0.000000 4.817087
.....