Pandas - 如何对多个变量进行分组和取消堆叠?

时间:2016-05-09 17:03:45

标签: python pandas dataframe

我目前有一些结构如下的数据集:

data = {'participant': [100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
        'step_name': ['first', 'first', 'second', 'third', 'second', 'first', 'first', 'first', 'second', 'third'],
        'title': ['acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'acceptable', 'acceptable'],
        'colour': ['blue', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'blue', 'blue', 'green'],
        'class': ['A', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'B']}
df = pd.DataFrame(data, columns=['participant', 'step_name', 'title', 'colour', 'class'])

看起来像:

+----+---------------+-------------+----------------+----------+---------+
|    |   participant | step_name   | title          | colour   | class   |
|----+---------------+-------------+----------------+----------+---------|
|  0 |           100 | first       | acceptable     | blue     | A       |
|  1 |           101 | first       | acceptable     | blue     | B       |
|  2 |           102 | second      | not acceptable | blue     | B       |
|  3 |           103 | third       | acceptable     | green    | A       |
|  4 |           104 | second      | not acceptable | green    | B       |
|  5 |           105 | first       | acceptable     | blue     | A       |
|  6 |           106 | first       | not acceptable | green    | A       |
|  7 |           107 | first       | acceptable     | blue     | A       |
|  8 |           108 | second      | acceptable     | blue     | A       |
|  9 |           109 | third       | acceptable     | green    | B       |
+----+---------------+-------------+----------------+----------+---------+

现在我想聚合数据集,以便每行计算每个重复变量,我目前设法按照两个变量(step_nametitle)进行计算,如下所示:

count_df = df[['participant', 'step_name', 'title']].groupby(['step_name', 'title']).count()
count_df = count_df.unstack()
count_df.fillna(0, inplace=True)
count_df.columns = count_df.columns.get_level_values(1)
count_df

+--------+--------------+------------------+
|        |   acceptable |   not acceptable |
|--------+--------------+------------------|
| first  |            4 |                1 |
| second |            1 |                2 |
| third  |            2 |                0 |
+--------+--------------+------------------+

现在,我希望有一组额外的列,其中包含其他变量的值(colourclass) - 基本上,我想要分组然后取消堆栈那些变量,但我不知道如何使用2个以上的变量。最终,我希望我的决赛桌看起来像这样:

+------+------+--------+--------------+------------------+
|class |colour| step   |   acceptable |   not acceptable |
|----------------------+--------------+------------------|
| A    | blue | first  |            3 |                0 |
| B    | blue | first  |            1 |                0 |
| A    |green | first  |            0 |                1 |
| B    |green | first  |            0 |                0 |
| A    | blue | second |            1 |                0 |
| B    | blue | second |            0 |                1 |
| A    |green | second |            0 |                0 |
| B    |green | second |            0 |                1 |
| A    |blue  | third  |            0 |                0 |
| B    |blue  | third  |            0 |                0 |
| A    |green | third  |            1 |                0 |
| B    |green | third  |            1 |                0 |
+------+------+--------+--------------+------------------+

如何重塑我的数据,使其看起来像我的最后一个例子?我还在使用unstack和group函数吗?

2 个答案:

答案 0 :(得分:6)

您可以使用pivot_table()

In [130]: df['count'] = 1

In [134]: (df.pivot_table(index=['class','colour','step_name'], columns='title',
   .....:                 values='count', aggfunc='sum', fill_value=0)
   .....:    .reset_index()
   .....: )
Out[134]:
title class colour step_name  acceptable  not acceptable
0         A   blue     first           3               0
1         A   blue    second           1               0
2         A  green     first           0               1
3         A  green     third           1               0
4         B   blue     first           1               0
5         B   blue    second           0               1
6         B  green    second           0               1
7         B  green     third           1               0

答案 1 :(得分:6)

我认为您需要pivot_table aggfunc=lenreset_indexrename_axispandas 0.18.0中的新内容):

df = df.pivot_table(index=['class','colour','step_name'], 
                    columns='title', 
                    aggfunc=len, 
                    values='participant', 
                    fill_value=0).reset_index().rename_axis(None, axis=1)
print df
      class colour step_name  acceptable  not acceptable
0         A   blue     first           3               0
1         A   blue    second           1               0
2         A  green     first           0               1
3         A  green     third           1               0
4         B   blue     first           1               0
5         B   blue    second           0               1
6         B  green    second           0               1
7         B  green     third           1               0