我正在使用Python进行Excel报表自动化。在excel报表中,我使用的数据透视表的报表布局为“紧凑表单”->其中,一个或多个列显示为行标题。例如-我下面有输入数据
Country City Employee Salary $
Mexico Chiapas A 100,000
Mexico Chihuahua B 245,132
Mexico Chihuahua C 200,000
Mexico Chihuahua D 175,000
United States Alabama E 106,088
United States Alaska F 56,121
United States Arizona G 9,737
United States Arizona H 250,000
在excel报告中,我将其显示为-
Row Labels Sum of Salary $
Mexico 720,132
Chiapas 100,000
A 100,000
Chihuahua 620,132
B 245,132
C 200,000
D 175,000
United States 421,946
Alabama 106,088
E 106,088
Alaska 56,121
F 56,121
Arizona 259,737
G 9,737
H 250,000
在紧凑视图中,而不是在单独的列中显示国家和城市,我将它们显示为行标题,这是excel中的功能。 我正在尝试在Python中复制相同的视图。我已使用pandas数据框输入原始文件。我使用了df.pivot和df.pivot_table,但无法获得上述视图。
我正在尝试df.pivot和df.pivot_table函数,但仅获得如下所示的常用视图-
Country City Employee Sum of Salary $
Mexico Chiapas A 100,000
Chihuahua B 245,132
C 200,000
D 175,000
United States Alabama E 106,088
Alaska F 56,121
Arizona G 9,737
H 250,000
答案 0 :(得分:0)
应用多个groupby
和concat
会更容易,但是您需要一个排序的框架,因此我的回答是可以专门解决您的问题:
df
Country City Employee Salary
0 Mexico Chiapas A 100000
1 Mexico Chihuahua B 245132
2 Mexico Chihuahua C 200000
3 Mexico Chihuahua D 175000
4 United States Alabama E 106088
5 United States Alaska F 56121
6 United States Arizona G 9737
7 United States Arizona H 250000
代码:
res = pd.DataFrame()
country = df.groupby("Country").sum()
for i in range(len(country)):
c = pd.DataFrame(country.iloc[i])
c = c.reset_index(drop = True)
c.index = c.columns
c = c.reset_index()
c.columns = ["Row Labels", "Salary"]
city = df[df["Country"] == country.iloc[i].name].groupby("City").sum()
for j in range(len(city)):
c2 = pd.DataFrame(city.iloc[j])
c2 = c2.reset_index(drop = True)
c2.index = c2.columns
c2 = c2.reset_index()
c2.columns = ["Row Labels", "Salary"]
employee = df[df["City"] == city.iloc[j].name].groupby("Employee").sum()
c3 = employee.reset_index()
c3.columns = ["Row Labels", "Salary"]
res = pd.concat([res,c,c2,c3])
res = res.reset_index(drop = True)
res = res.drop_duplicates().reset_index(drop = True)
结果:
res
Row Labels Salary
0 Mexico 720132
1 Chiapas 100000
2 A 100000
3 Chihuahua 620132
4 B 245132
5 C 200000
6 D 175000
7 United States 421946
8 Alabama 106088
9 E 106088
10 Alaska 56121
11 F 56121
12 Arizona 259737
13 G 9737
14 H 250000
如果您不介意标签的种类,则以下解决方案会更快(如果您的数据集很大):
c1 = df.groupby(["Country"])["Salary"].sum().reset_index()
c1.columns = ["Row Labels", "Salary"]
c2 = df.groupby(["Country","City"])["Salary"].sum().reset_index()[["City","Salary"]]
c2.columns = ["Row Labels", "Salary"]
c3 = df.groupby(["Country","City","Employee"])["Salary"].sum().reset_index()[["Employee","Salary"]]
c3.columns = ["Row Labels", "Salary"]
res = pd.concat([c1,c2,c3])
res
Row Labels Salary
0 Mexico 720132
1 United States 421946
0 Chiapas 100000
1 Chihuahua 620132
2 Alabama 106088
3 Alaska 56121
4 Arizona 259737
0 A 100000
1 B 245132
2 C 200000
3 D 175000
4 E 106088
5 F 56121
6 G 9737
7 H 250000
希望它能起作用!