如何汇总跨子行和子列的熊猫数据透视表

时间:2020-10-20 02:52:52

标签: python pandas pivot-table aggregate

我正在python中使用pandas来透视一些数据,并且我希望能够在透视表的各个部分之间执行两种类型的聚合。我知道我可以使用边距对所有行/列执行汇总。 但是我想在单个列中聚合多个行(不是全部),或者在单个行中聚合多个列。如何最好地汇总熊猫中的子行和子列?

示例代码设置:

#Dataset
rows = [
    [1, 'Factory_1', 'crusher', 'electricity_usage', 15],
    [2, 'Factory_1', 'mixer', 'electricity_usage', 11],
    [3, 'Factory_1', 'turner', 'electricity_usage', 12],
    [4, 'Factory_2', 'crusher', 'electricity_usage', 2],
    [5, 'Factory_2', 'mixer', 'electricity_usage', 7],
    [6, 'Factory_2', 'turner', 'electricity_usage', 13],
    [7, 'Factory_1', 'crusher', 'running_hours', 6],
    [8, 'Factory_1', 'mixer', 'running_hours', 5],
    [9, 'Factory_1', 'turner', 'running_hours', 5],
    [10, 'Factory_2', 'crusher', 'running_hours', 1],
    [11, 'Factory_2', 'mixer', 'running_hours', 3],
    [12, 'Factory_2', 'turner', 'running_hours', 6]
]

dataFrame = pds.DataFrame(rows, columns=["id","Location","Type","recorded_type","value"])

#Pivot Table 1: Form multi row aggregation across a single column
ptable_1 = pds.pivot_table(data=dataFrame,index=['Location', 'Type'], columns=["recorded_type"], values=['value'])
print(ptable_1)

#Pivot Table 2: Form multi column aggregation across a single row
ptable_2 = pds.pivot_table(data=dataFrame,index=['recorded_type'], columns=["Location", "Type"], values=['value'])
print(ptable_2)

下面,我尝试在单个列中的多个行上聚合数据透视1。我正在尝试汇总每个位置的所有计算机的recorded_values之和。可以做得更好吗?

#Form aggregation across multiple rows in a single column

df1 = ptable_1.groupby(level=[0]).sum()
df1['Type'] = ["all", "all"]
#Reset index so machine_location is removed from current index
df1.reset_index(inplace=True)
#Set multi-index of location and type
df1.set_index(['Location', 'Type'], inplace=True)
#Concat both dataframes
aggregated_table_1 = pds.concat([ptable_1.reset_index(),df1.reset_index()], ignore_index=True)
#Sort values by location, so appened table values are in the correct position
aggregated_table_1.sort_values('Location', inplace=True)

print(aggregated_table_1)

例如,我正在尝试汇总特定工厂的所有机器类型的用电量。因此,聚合位于类型为“ all”的“类型”列中 ptable_1的预期输出:

+---------------+-----------+---------+-------------------+---------------+
|               | Location  |  Type   |       value       |     value     |
+---------------+-----------+---------+-------------------+---------------+
| recorded_type |           |         | electricity_usage | running_hours |
|               | Factory_1 | crusher | 15                | 6             |
|               | Factory_1 | mixer   | 11                | 5             |
|               | Factory_1 | turner  | 12                | 5             |
|               | Factory_1 | all     | 38                | 16            |
|               | Factory_2 | crusher | 2                 | 1             |
|               | Factory_2 | mixer   | 7                 | 3             |
|               | Factory_2 | turner  | 13                | 6             |
|               | Factory_2 | all     | 22                | 10            |
+---------------+-----------+---------+-------------------+---------------+

其次,我不确定如何在各个子列之间进行汇总,如下所示,以得出ptable_2每种类型的所有列的总和。聚合是一个新列,其类型为'all'

ptable_2的预期输出:

+-------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|     Location      | Factory_1 | Factory_1 | Factory_1 | Factory_1 | Factory_2 | Factory_2 | Factory_2 | Factory_2 |
+-------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Type              | crusher   | mixer     | turner    | all       | crusher   | mixer     | turner    | all       |
| recorded_type     |           |           |           |           |           |           |           |           |
| electricity_usage | 15        | 11        | 12        | 38        | 2         | 7         | 13        | 22        |
| running_hours     | 6         | 5         | 5         | 16        | 1         | 3         | 6         | 10        |
+-------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

编辑1 这是我的输出,直接来自使用默认参数的melt()的Serge de Gosson de Varennes方法的python。我丢失了每一行的recorded_type记录,该记录被NaN列替换。我是否应该尝试以此汇总以形成预期的输出?

Df_ex1 = dfex1.melt() # Expected output 1
      NaN      recorded_type  value
0   value  electricity_usage     15
1   value  electricity_usage     11
2   value  electricity_usage     12
3   value  electricity_usage      2
4   value  electricity_usage      7
5   value  electricity_usage     13
6   value      running_hours      6
7   value      running_hours      5
8   value      running_hours      5
9   value      running_hours      1
10  value      running_hours      3
11  value      running_hours      6


Df_exp2 = dfex2.melt() # Expected output 2
      NaN   Location     Type  value
0   value  Factory_1  crusher     15
1   value  Factory_1  crusher      6
2   value  Factory_1    mixer     11
3   value  Factory_1    mixer      5
4   value  Factory_1   turner     12
5   value  Factory_1   turner      5
6   value  Factory_2  crusher      2
7   value  Factory_2  crusher      1
8   value  Factory_2    mixer      7
9   value  Factory_2    mixer      3
10  value  Factory_2   turner     13
11  value  Factory_2   turner      6

1 个答案:

答案 0 :(得分:0)

您几乎是对的:您需要融合数据框:

import pandas as pds
rows = [
    [1, 'Factory_1', 'crusher', 'electricity_usage', 15],
    [2, 'Factory_1', 'mixer', 'electricity_usage', 11],
    [3, 'Factory_1', 'turner', 'electricity_usage', 12],
    [4, 'Factory_2', 'crusher', 'electricity_usage', 2],
    [5, 'Factory_2', 'mixer', 'electricity_usage', 7],
    [6, 'Factory_2', 'turner', 'electricity_usage', 13],
    [7, 'Factory_1', 'crusher', 'running_hours', 6],
    [8, 'Factory_1', 'mixer', 'running_hours', 5],
    [9, 'Factory_1', 'turner', 'running_hours', 5],
    [10, 'Factory_2', 'crusher', 'running_hours', 1],
    [11, 'Factory_2', 'mixer', 'running_hours', 3],
    [12, 'Factory_2', 'turner', 'running_hours', 6]
]

dataFrame = pds.DataFrame(rows, columns=["id","Location","Type","recorded_type","value"])


ptable_1 = pds.pivot_table(data=dataFrame,index=['Location', 'Type'], columns=["recorded_type"], values=['value'])


ptable_2 = pds.pivot_table(data=dataFrame,index=['recorded_type'], columns=["Location", "Type"], values=['value'])
df = pds.DataFrame(ptable_1)

dfex1 = pds.DataFrame(ptable_1)
dfex2 = pds.DataFrame(ptable_2)

给你

Df_ex1 = dfex1.melt # Expected output 1
Df_exp2 = dfex2.melt # Expected output 2

enter image description here