我试图在多个分组级别之后获得数据帧的滚动总和:
import pandas as pd
import numpy as np
year_vec = np.arange(2000, 2005)
month_vec = np.arange(1, 4)
soln_list = []
firmList = [61, 62, 63]
firmId = []
year_month = []
year = []
month = []
for firmIndex in range(0, len(firmList)):
for yearIndex in range(0, len(year_vec)):
for monthIndex in range(0, len(month_vec)):
soln_list.append("soln_%s_%s_%s" % (firmList[firmIndex], year_vec[yearIndex], month_vec[monthIndex]))
firmId.append(firmList[firmIndex])
month.append(month_vec[monthIndex])
year.append(year_vec[yearIndex])
year_month.append("%s_%s" % (year_vec[yearIndex], month_vec[monthIndex]))
df = pd.DataFrame({'firmId': firmId, 'year': year, 'month': month, 'year_month' : year_month,
'soln_vars': soln_list})
df = df.set_index(["firmId", "year_month"])
结果数据框如下所示:
month soln_vars year
firmId year_month
61 2000_1 1 soln_61_2000_1 2000
2000_2 2 soln_61_2000_2 2000
2000_3 3 soln_61_2000_3 2000
2001_1 1 soln_61_2001_1 2001
2001_2 2 soln_61_2001_2 2001
2001_3 3 soln_61_2001_3 2001
2002_1 1 soln_61_2002_1 2002
... ... ...
在这一点上,我想要每2年soln_vars
的总和,每个月每个公司。为此,我首先按firmId
和year
进行分组,然后总结:
df = df.groupby([df.index.get_level_values(0), "year"])["soln_vars"].sum()
此操作为我提供了每家公司每年soln_vars
的总和:
firmId year
61 2000 soln_61_2000_1soln_61_2000_2soln_61_2000_3
2001 soln_61_2001_1soln_61_2001_2soln_61_2001_3
2002 soln_61_2002_1soln_61_2002_2soln_61_2002_3
2003 soln_61_2003_1soln_61_2003_2soln_61_2003_3
2004 soln_61_2004_1soln_61_2004_2soln_61_2004_3
62 2000 soln_62_2000_1soln_62_2000_2soln_62_2000_3
2001 soln_62_2001_1soln_62_2001_2soln_62_2001_3
... ...
在我的应用程序中,解决方案变量由另一个库提供,导致数学表达式:soln_61_2000_1 +soln_61_2000_2
+ soln_61_2000_3
- 为简单起见,我在这里使用字符串。
然后按firmId
分组并应用滚动总和:
df = df.groupby(level=0, group_keys=False).rolling(2).sum()
不会更改df
。在澄清这一点时,我们对此表示赞赏。