熊猫 - 按行将数据框分解为行

时间:2017-10-11 21:04:50

标签: python-2.7 pandas data-cleaning

这是我过去几天一直试图完成的一个项目。我们正在寻找更好的方法将财务数据集成到我们的仪表板中,但我们使用的软件以令人作呕的方式导出我们的数据,这种方式无法插入任何类型的程序,因为它意味着一个人可以直观地看一眼并获得一个想法。

我希望得到关于如何正确编码的建议,但是如果我在解决它的方式上疯了。这些数据已经过大量清理,所以如果有严重错误请告诉我:

                 Expense Categories Jan Actual Jan Budget Feb Actual  \
3    5600 Direct Personnel Expenses    2521.73          0    -290.57   
4           6000 Automobile Expense     909.33       1314     483.15   
5         6160 Funeral Home Expense       1072    1800.02          0   
6                  6400 Lab Expense          0          0      65.18   
9        6100 Marketing & Promotion     543.13    1850.01    1158.41   

同样,在清理时我拉出了变量,例如:

department = "PR"
direct_indirect = {'5600 Direct Personnel Expenses' : 'Direct Expense', etc}

我的最终目标是在我通过画面为每个部门设计的仪表板中包含预算摘要,因此我相信最佳结果将如下所示:

Expense Category  Direct/Indirect  Department   Month-Year  Actual  Budget
6400 Lab Expense    Direct Expense   PR          jan 2016     0       0
6400 Lab Expense    Direct Expense   PR          feb 2016     0       0
6400 Lab Expense    Direct Expense   PR          mar 2016     0       0
6400 Lab Expense    Direct Expense   PR          apr 2016     0       0
6400 Lab Expense    Direct Expense   PR          may 2016     0       0

我正在努力解决如何完成这一问题,我完全不确定如何通过在每个费用类型的新数据框架中创建多行来实现,每两个列都是一个新的月份。我觉得唯一的方法是使用:

for index, row in df1.iterrows():

但是我会迷失在如何迭代每一列,然后将它们分配给一个新的数据帧。

如果我遗漏了您需要的任何详细信息,请告诉我们。感谢您的帮助。

安迪

2 个答案:

答案 0 :(得分:2)

meltpivot_table

df=df.melt('Expense Categories')
df[['Month','Type']]=df.variable.str.split(' ',expand=True)
df=pd.pivot_table(df,index=['Expense Categories','Month'],columns='Type',values='value').reset_index()
df

Out[1176]: 
Type              Expense Categories Month   Actual   Budget
0     5600 Direct Personnel Expenses   Feb  -290.57      NaN
1     5600 Direct Personnel Expenses   Jan  2521.73     0.00
2            6000 Automobile Expense   Feb   483.15      NaN
3            6000 Automobile Expense   Jan   909.33  1314.00
4         6100 Marketing & Promotion   Feb  1158.41      NaN
5         6100 Marketing & Promotion   Jan   543.13  1850.01
6          6160 Funeral Home Expense   Feb     0.00      NaN
7          6160 Funeral Home Expense   Jan  1072.00  1800.02
8                   6400 Lab Expense   Feb    65.18      NaN
9                   6400 Lab Expense   Jan     0.00     0.00

我们几乎到达那里

df['department']='PR'
df['Direct/Indirect'] = 'Direct Expense'
df['Month-Year'] = df['Month'] + str(2016)
df
Out[1182]: 
Type              Expense Categories Month   Actual   Budget department  \
0     5600 Direct Personnel Expenses   Feb  -290.57      NaN         PR   
1     5600 Direct Personnel Expenses   Jan  2521.73     0.00         PR   
2            6000 Automobile Expense   Feb   483.15      NaN         PR   
3            6000 Automobile Expense   Jan   909.33  1314.00         PR   
4         6100 Marketing & Promotion   Feb  1158.41      NaN         PR   
5         6100 Marketing & Promotion   Jan   543.13  1850.01         PR   
6          6160 Funeral Home Expense   Feb     0.00      NaN         PR   
7          6160 Funeral Home Expense   Jan  1072.00  1800.02         PR   
8                   6400 Lab Expense   Feb    65.18      NaN         PR   
9                   6400 Lab Expense   Jan     0.00     0.00         PR   
Type Direct/Indirect Month-Year  
0     Direct Expense    Feb2016  
1     Direct Expense    Jan2016  
2     Direct Expense    Feb2016  
3     Direct Expense    Jan2016  
4     Direct Expense    Feb2016  
5     Direct Expense    Jan2016  
6     Direct Expense    Feb2016  
7     Direct Expense    Jan2016  
8     Direct Expense    Feb2016  
9     Direct Expense    Jan2016  

答案 1 :(得分:1)

您可以使用df.columns.str.splitstack重塑数据框:

import sys
import pandas as pd

df = pd.DataFrame({'Expense Categories': ['5600 Direct Personnel Expenses', '6000 Automobile Expense', '6160 Funeral Home Expense', '6400 Lab Expense', '6100 Marketing & Promotion'], 'Feb Actual': [-290.57, 483.15, 0.0, 65.18, 1158.41], 'Jan Actual': [2521.73, 909.33, 1072.0, 0.0, 543.13], 'Jan Budget': [0.0, 1314.0, 1800.02, 0.0, 1850.01]})

df = df.set_index('Expense Categories')
df.columns = df.columns.str.split(expand=True)
df.columns.names = ['Month-Year',None]
df = df.stack('Month-Year')
df = df.reset_index()
df['Direct/Indirect'] = 'Direct Expense'
df['Department'] = 'PR'
df['Month-Year'] = df['Month-Year'] + ' 2016'

with pd.option_context('display.width', sys.maxsize):
    print(df)

产量

               Expense Categories Month-Year   Actual   Budget Direct/Indirect Department
0  5600 Direct Personnel Expenses   Feb 2016  -290.57      NaN  Direct Expense         PR
1  5600 Direct Personnel Expenses   Jan 2016  2521.73     0.00  Direct Expense         PR
2         6000 Automobile Expense   Feb 2016   483.15      NaN  Direct Expense         PR
3         6000 Automobile Expense   Jan 2016   909.33  1314.00  Direct Expense         PR
4       6160 Funeral Home Expense   Feb 2016     0.00      NaN  Direct Expense         PR
5       6160 Funeral Home Expense   Jan 2016  1072.00  1800.02  Direct Expense         PR
6                6400 Lab Expense   Feb 2016    65.18      NaN  Direct Expense         PR
7                6400 Lab Expense   Jan 2016     0.00     0.00  Direct Expense         PR
8      6100 Marketing & Promotion   Feb 2016  1158.41      NaN  Direct Expense         PR
9      6100 Marketing & Promotion   Jan 2016   543.13  1850.01  Direct Expense         PR

<强>解释

df = df.set_index('Expense Categories')
df.columns = df.columns.str.split(expand=True)
df.columns.names = ['Month-Year',None]

这些行为列索引创建MultiIndex。它将Month与列标签的Acrtual / Budget部分分开。此处使用set_index隐藏Expense Categories操作中的str.split列。此时df看起来像这样:

Month-Year                          Feb      Jan         
                                 Actual   Actual   Budget
Expense Categories                                       
5600 Direct Personnel Expenses  -290.57  2521.73     0.00
6000 Automobile Expense          483.15   909.33  1314.00
6160 Funeral Home Expense          0.00  1072.00  1800.02
6400 Lab Expense                  65.18     0.00     0.00
6100 Marketing & Promotion      1158.41   543.13  1850.01

现在我们可以使用Jan/Febstack(或更确切地说,“月 - 年”级别的索引)移动到自己的列中:

df = df.stack('Month-Year')

产量

                                            Actual   Budget
Expense Categories             Month-Year                  
5600 Direct Personnel Expenses Feb         -290.57      NaN
                               Jan         2521.73     0.00
6000 Automobile Expense        Feb          483.15      NaN
                               Jan          909.33  1314.00
6160 Funeral Home Expense      Feb            0.00      NaN
                               Jan         1072.00  1800.02
6400 Lab Expense               Feb           65.18      NaN
                               Jan            0.00     0.00
6100 Marketing & Promotion     Feb         1158.41      NaN
                               Jan          543.13  1850.01