在Pandas数据透视表中聚合多个字符串值

时间:2018-03-05 01:55:22

标签: pandas pivot pivot-table

我正在尝试创建一个日历,该日历在项目目录中汇总信息,并按时间顺序和项目类型进行组织。我一直在使用Pandas并且无法正确获得基本结构。例如,给定此数据集:

table(u)
table(v)

使用Remove none values from dataframe中显示的技巧,我可以创建字段来跟踪最终表格中每个项目的排名顺序:

           Type      Name   Health Month  Year
0     Marketing  ProjectA       OK   Jan  2018
1       Science  ProjectB  Warning   Apr  2018
2     Marketing  ProjectC       OK   Mar  2018
3   Development  ProjectD       OK   Feb  2018
4     Marketing  ProjectE       OK   Jan  2018
5   Development  ProjectF  Warning   Feb  2018
6   Development  ProjectG  Trouble   May  2018
7     Marketing  ProjectH  Trouble   May  2018
8   Development  ProjectI  Warning   Feb  2018
9     Marketing  ProjectJ       OK   May  2018
10      Science  ProjectK  Warning   Apr  2018

产生2个额外的列:

df['aggval'] = df['Year'].map(str) + df['Month'] + df['Type']
df['index'] = df.groupby(['aggval']).cumcount()

通过这些提取列,我们现在可以进行透视以创建项目汇总表的初始版本:

           Type      Name   Health Month  Year              aggval  index
0     Marketing  ProjectA       OK   Jan  2018    2018JanMarketing      0
1       Science  ProjectB  Warning   Apr  2018      2018AprScience      0
2     Marketing  ProjectC       OK   Mar  2018    2018MarMarketing      0
3   Development  ProjectD       OK   Feb  2018  2018FebDevelopment      0
4     Marketing  ProjectE       OK   Jan  2018    2018JanMarketing      1
5   Development  ProjectF  Warning   Feb  2018  2018FebDevelopment      1
6   Development  ProjectG  Trouble   May  2018  2018MayDevelopment      0
7     Marketing  ProjectH  Trouble   May  2018    2018MayMarketing      0
8   Development  ProjectI  Warning   Feb  2018  2018FebDevelopment      2
9     Marketing  ProjectJ       OK   May  2018    2018MayMarketing      1
10      Science  ProjectK  Warning   Apr  2018      2018AprScience      1

制作以下报告。这基本上是正确的:它收集并列出项目,显示他们的名称,并按类型(泳道)按时间顺序按年份和月份进行组织:

pv1 = pd.pivot_table(df, values='Name', index=['Type', 'index'], columns=['Year', 'Month'], aggfunc=lambda x: "".join(x)).fillna('')
pv1 = pv1.reindex(columns = zip(12 * [2018], ['Jan', 'Feb', 'Mar', 'Apr', 'May']))

我现在难以尝试扩展此模型以一起显示每个项目的名称和运行状况。

我可以在Health字段中添加第二个数据透视表值:

Year                 2018                                          
Month                Jan       Feb       Mar       Apr       May   
Type        index                                                  
Development 0                ProjectD                      ProjectG
            1                ProjectF                              
            2                ProjectI                              
Marketing   0      ProjectA            ProjectC            ProjectH
            1      ProjectE                                ProjectJ
Science     0                                    ProjectB          
            1                                    ProjectK          

生产:

pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')
# pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))

这是一个正确的想法 - 项目健康和名称都显示在每个项目中,在正确的月份和右侧类型泳道中,但我希望它们与项目并排。重新索引列会在标题级别生成正确的结果,但会使用Nan值清除单元格:

                   Health                               Name                                          
Year                2018                                2018                                          
Month               Apr      Feb    Jan Mar   May       Apr       Feb       Jan       Mar       May   
Type        index                                                                                     
Development 0                    OK          Trouble            ProjectD                      ProjectG
            1               Warning                             ProjectF                              
            2               Warning                             ProjectI                              
Marketing   0                        OK  OK  Trouble                      ProjectA  ProjectC  ProjectH
            1                        OK           OK                      ProjectE            ProjectJ
Science     0      Warning                            ProjectB                                        
            1      Warning                            ProjectK                  

产生

pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')
pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))

同样,结构现在正确,但单元格值不再显示项目特定数据。我错过了什么?

2 个答案:

答案 0 :(得分:1)

IIUC,您只需要swaplevelsort_index

#pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')

pv2.swaplevel(0,1,axis=1).swaplevel(1,2,axis=1).sort_index(axis=1)

Out[220]: 
Year                  2018                                                \
Month                  Apr                Feb              Jan             
                    Health      Name   Health      Name Health      Name   
Type        index                                                          
Development 0                              OK  ProjectD                    
            1                         Warning  ProjectF                    
            2                         Warning  ProjectI                    
Marketing   0                                               OK  ProjectA   
            1                                               OK  ProjectE   
Science     0      Warning  ProjectB                                       
            1      Warning  ProjectK                                       
Year                                                   
Month                Mar                May            
                  Health      Name   Health      Name  
Type        index                                      
Development 0                       Trouble  ProjectG  
            1                                          
            2                                          
Marketing   0         OK  ProjectC  Trouble  ProjectH  
            1                            OK  ProjectJ  
Science     0                                          
            1                                          

#pv2.swaplevel(0,1,axis=1).swaplevel(1,2,axis=1).sort_index(axis=1).to_excel('aaaaaa.xlsx')

enter image description here

答案 1 :(得分:1)

pv2按此顺序开始列:

In [35]: pv2.columns.tolist()
Out[35]: 
[('Health', 2018, 'Apr'),
 ('Health', 2018, 'Feb'),
 ('Health', 2018, 'Jan'),
 ('Health', 2018, 'Mar'),
 ('Health', 2018, 'May'),
 ('Name', 2018, 'Apr'),
 ('Name', 2018, 'Feb'),
 ('Name', 2018, 'Jan'),
 ('Name', 2018, 'Mar'),
 ('Name', 2018, 'May')]

我们想要重新排列列以获得此订单:

In [36]: list(zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))
Out[36]: 
[(2018, 'Jan', 'Health'),
 (2018, 'Jan', 'Name'),
 (2018, 'Feb', 'Health'),
 (2018, 'Feb', 'Name'),
 (2018, 'Mar', 'Health'),
 (2018, 'Mar', 'Name'),
 (2018, 'Apr', 'Health'),
 (2018, 'Apr', 'Name'),
 (2018, 'May', 'Health'),
 (2018, 'May', 'Name')]

每列由3元组表示。 reindex可以对列列表重新排序,但不能更改3元组中项目的内部顺序。为此,请使用reorder_levels

In [37]: pv2 = pv2.reorder_levels(['Year','Month',0], axis=1)
In [38]: pv2.columns.tolist()
Out[38]: 
[(2018, 'Apr', 'Health'),
 (2018, 'Feb', 'Health'),
 (2018, 'Jan', 'Health'),
 (2018, 'Mar', 'Health'),
 (2018, 'May', 'Health'),
 (2018, 'Apr', 'Name'),
 (2018, 'Feb', 'Name'),
 (2018, 'Jan', 'Name'),
 (2018, 'Mar', 'Name'),
 (2018, 'May', 'Name')]

按照所需顺序获得级别后,您可以致电reindex重新排序列(按顺序获取月份)。

import sys
import pandas as pd
pd.options.display.width = sys.maxsize

df = pd.DataFrame({'Health': ['OK', 'Warning', 'OK', 'OK', 'OK', 'Warning', 'Trouble', 'Trouble', 'Warning', 'OK', 'Warning'], 'Month': ['Jan', 'Apr', 'Mar', 'Feb', 'Jan', 'Feb', 'May', 'May', 'Feb', 'May', 'Apr'], 'Name': ['ProjectA', 'ProjectB', 'ProjectC', 'ProjectD', 'ProjectE', 'ProjectF', 'ProjectG', 'ProjectH', 'ProjectI', 'ProjectJ', 'ProjectK'], 'Type': ['Marketing', 'Science', 'Marketing', 'Development', 'Marketing', 'Development', 'Development', 'Marketing', 'Development', 'Marketing', 'Science'], 'Year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018]})

df['index'] = df.groupby(['Year','Month','Type']).cumcount()

pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], 
                     columns=['Year', 'Month'], 
                     aggfunc={'Name':lambda x: "|".join(x), 
                              'Health':lambda x: ":".join(x), }).fillna('')
pv2 = pv2.reorder_levels(['Year','Month',0], axis=1)
pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))

print(pv2)

产量

Year                2018                                                                                    
Month                Jan                Feb              Mar                Apr                May          
                  Health      Name   Health      Name Health      Name   Health      Name   Health      Name
Type        index                                                                                           
Development 0                            OK  ProjectD                                      Trouble  ProjectG
            1                       Warning  ProjectF                                                       
            2                       Warning  ProjectI                                                       
Marketing   0         OK  ProjectA                        OK  ProjectC                     Trouble  ProjectH
            1         OK  ProjectE                                                              OK  ProjectJ
Science     0                                                           Warning  ProjectB                   
            1                                                           Warning  ProjectK                   

虽然有时您可能需要手动指定所需的顺序 列,这不是(必然)这些情况之一。你想要的订单是 自然日期顺序。因此,解析Year和{...}对我们有利 Month标记为实际日期(dtype datetime64[ns])。这解锁了熊猫的智能日期时间处理行为。

  • 例如,如果我们使用日期列(即dtype pivot_table列),datetime64[ns]会自动为我们排序日期。

  • 此外,我们可以方便地按顺序生成所有日历月 ,而无需手动输入日期:

    dates = pd.date_range('2018-01-01', '2018-12-31', freq='MS')
    
  • 我们可以轻松地将DatetimeIndex转换为2级MultiIndex年/月格式(用于演示目的):

    pv2.index = pd.Index(pv2.index.strftime('%Y-%b')).str.split('-', expand=True)
    

例如,

import sys
import pandas as pd
pd.options.display.width = sys.maxsize

df = pd.DataFrame({'Health': ['OK', 'Warning', 'OK', 'OK', 'OK', 'Warning', 'Trouble', 'Trouble', 'Warning', 'OK', 'Warning'], 'Month': ['Jan', 'Apr', 'Mar', 'Feb', 'Jan', 'Feb', 'May', 'May', 'Feb', 'May', 'Apr'], 'Name': ['ProjectA', 'ProjectB', 'ProjectC', 'ProjectD', 'ProjectE', 'ProjectF', 'ProjectG', 'ProjectH', 'ProjectI', 'ProjectJ', 'ProjectK'], 'Type': ['Marketing', 'Science', 'Marketing', 'Development', 'Marketing', 'Development', 'Development', 'Marketing', 'Development', 'Marketing', 'Science'], 'Year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018]})

df['Date'] = pd.to_datetime(df['Year'].astype('str')+df['Month'], format='%Y%b')
df['index'] = df.groupby(['Date','Type']).cumcount()

pv2 = pd.pivot_table(df, values=['Name', 'Health'], columns=['Type', 'index'], 
                     index=['Date'], 
                     aggfunc={'Name':lambda x: "|".join(x), 
                              'Health':lambda x: ":".join(x), }).fillna('')

dates = pd.date_range('2018-01-01', '2018-12-31', freq='MS')
pv2 = pv2.reindex(dates, fill_value='')
pv2.index = pd.Index(pv2.index.strftime('%Y-%b')).str.split('-', expand=True)
pv2 = pv2.stack(0)
pv2 = pv2.T
print(pv2)

产量

                    2018                                                                                     ...                                                             
                     Jan                Feb              Mar                Apr                May           ...     Aug         Sep         Oct         Nov         Dec     
                  Health      Name   Health      Name Health      Name   Health      Name   Health      Name ...  Health Name Health Name Health Name Health Name Health Name
Type        index                                                                                            ...                                                             
Development 0                            OK  ProjectD                                      Trouble  ProjectG ...                                                             
            1                       Warning  ProjectF                                                        ...                                                             
            2                       Warning  ProjectI                                                        ...                                                             
Marketing   0         OK  ProjectA                        OK  ProjectC                     Trouble  ProjectH ...                                                             
            1         OK  ProjectE                                                              OK  ProjectJ ...                                                             
Science     0                                                           Warning  ProjectB                    ...                                                             
            1                                                           Warning  ProjectK                    ...