如何旋转pandas DataFrame,然后添加层次结构列?

时间:2018-07-14 21:43:45

标签: python pandas data-cleaning preprocessor dataframe

有人可以帮助我理解将记录形式(数据集A)的Python pandas DataFrame转换为以嵌套列为中心的数据集(如数据集B所示)的步骤吗?

对于此问题,基础架构具有以下规则:

  • 每个ProjectID出现一次
  • 每个ProjectID都与一个PM相关联
  • 每个ProjectID都与一个类别相关联
  • 多个ProjectID可以与一个类别相关联
  • 多个ProjectID可以与一个PM相关联

输入数据集A

df_A = pd.DataFrame({'ProjectID':[1,2,3,4,5,6,7,8],
          'PM':['Bob','Jill','Jack','Jack','Jill','Amy','Jill','Jack'],
          'Category':['Category A','Category B','Category C','Category B','Category A','Category D','Category B','Category B'],
          'Comments':['Justification 1','Justification 2','Justification 3','Justification 4','Justification 5','Justification 6','Justification 7','Justification 8'],
          'Score':[10,7,10,5,15,10,0,2]})

enter image description here

所需的输出 enter image description here 请注意,在各列之间添加了嵌套索引。还要注意,“注释”和“分数”都出现在“ ProjectID”下方的同一级别。最后,了解所需的输出如何不聚合任何数据,而是将类别数据分组/合并为每个类别值一行。

到目前为止,我已经尝试过:

  • df_A.set_index(['Category','ProjectID'],append = True).unstack()-仅当我首先创建嵌套索引为['Category', 'ProjectID],并将其添加到使用标准数据框创建的原始数字索引中,但是,它会将Category / ProjectID匹配的每个实例重复作为其自己的行(由于原始索引)。
  • df_A.groupby()-我无法使用它,因为它似乎强制某种形式的聚集以获取a的所有值单行中的单个类别。
  • df_A.pivot('Category','ProjectID',values ='Comments')-我可以执行数据透视以避免不必要的聚合,并且它看起来与预期的输出类似,但是只能看到“注释”字段,也不能以这种方式设置嵌套列。尝试在数据透视表语句中设置values = ['Comments','Score']时收到错误消息。

我认为答案在数据透视表,unstack,set_index或groupby之间,但是我不知道如何完成数据透视表,然后添加适当的嵌套列索引。

感谢您有任何想法。
根据T先生的评论更新了问题。谢谢。

1 个答案:

答案 0 :(得分:0)

我认为这就是您想要的:

pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2)

Out[4]:
PM                        Amy                    Bob        ...              Jill
ProjectID                   6                      1        ...                 5                      7
                     Comments Score         Comments Score  ...          Comments Score         Comments Score
  Category                                                  ...
0 Category A              NaN   NaN  Justification 1    10  ...   Justification 5    15              NaN   NaN
  Category B              NaN   NaN              NaN   NaN  ...               NaN   NaN  Justification 7     0
  Category C              NaN   NaN              NaN   NaN  ...               NaN   NaN              NaN   NaN
  Category D  Justification 6    10              NaN   NaN  ...               NaN   NaN              NaN   NaN

[4 rows x 16 columns]

编辑: 要按类别选择行,您应该通过添加.xs()来摆脱行索引0:

In [3]: df_A_transformed = pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2).xs(0)

In [4]: df_A_transformed
Out[4]:
PM                      Amy                    Bob        ...              Jill
ProjectID                 6                      1        ...                 5                      7
                   Comments Score         Comments Score  ...          Comments Score         Comments Score
Category                                                  ...
Category A              NaN   NaN  Justification 1    10  ...   Justification 5    15              NaN   NaN
Category B              NaN   NaN              NaN   NaN  ...               NaN   NaN  Justification 7     0
Category C              NaN   NaN              NaN   NaN  ...               NaN   NaN              NaN   NaN
Category D  Justification 6    10              NaN   NaN  ...               NaN   NaN              NaN   NaN

[4 rows x 16 columns]

In [5]: df_A_transformed.loc['Category B']
Out[5]:
PM    ProjectID
Amy   6          Comments                NaN
                 Score                   NaN
Bob   1          Comments                NaN
                 Score                   NaN
Jack  3          Comments                NaN
                 Score                   NaN
      4          Comments    Justification 4
                 Score                     5
      8          Comments    Justification 8
                 Score                     2
Jill  2          Comments    Justification 2
                 Score                     7
      5          Comments                NaN
                 Score                   NaN
      7          Comments    Justification 7
                 Score                     0
Name: Category B, dtype: object