Question

我目前有一些结构如下的数据集：

data = {'participant': [100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
        'step_name': ['first', 'first', 'second', 'third', 'second', 'first', 'first', 'first', 'second', 'third'],
        'title': ['acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'acceptable', 'acceptable'],
        'colour': ['blue', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'blue', 'blue', 'green'],
        'class': ['A', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'B']}
df = pd.DataFrame(data, columns=['participant', 'step_name', 'title', 'colour', 'class'])

看起来像：

+----+---------------+-------------+----------------+----------+---------+
|    |   participant | step_name   | title          | colour   | class   |
|----+---------------+-------------+----------------+----------+---------|
|  0 |           100 | first       | acceptable     | blue     | A       |
|  1 |           101 | first       | acceptable     | blue     | B       |
|  2 |           102 | second      | not acceptable | blue     | B       |
|  3 |           103 | third       | acceptable     | green    | A       |
|  4 |           104 | second      | not acceptable | green    | B       |
|  5 |           105 | first       | acceptable     | blue     | A       |
|  6 |           106 | first       | not acceptable | green    | A       |
|  7 |           107 | first       | acceptable     | blue     | A       |
|  8 |           108 | second      | acceptable     | blue     | A       |
|  9 |           109 | third       | acceptable     | green    | B       |
+----+---------------+-------------+----------------+----------+---------+

现在我想聚合数据集，以便每行计算每个重复变量，我目前设法按照两个变量（step_name和title）进行计算，如下所示：

count_df = df[['participant', 'step_name', 'title']].groupby(['step_name', 'title']).count()
count_df = count_df.unstack()
count_df.fillna(0, inplace=True)
count_df.columns = count_df.columns.get_level_values(1)
count_df

+--------+--------------+------------------+
|        |   acceptable |   not acceptable |
|--------+--------------+------------------|
| first  |            4 |                1 |
| second |            1 |                2 |
| third  |            2 |                0 |
+--------+--------------+------------------+

现在，我希望有一组额外的列，其中包含其他变量的值（colour和class） - 基本上，我想要分组然后取消堆栈那些变量，但我不知道如何使用2个以上的变量。最终，我希望我的决赛桌看起来像这样：

+------+------+--------+--------------+------------------+
|class |colour| step   |   acceptable |   not acceptable |
|----------------------+--------------+------------------|
| A    | blue | first  |            3 |                0 |
| B    | blue | first  |            1 |                0 |
| A    |green | first  |            0 |                1 |
| B    |green | first  |            0 |                0 |
| A    | blue | second |            1 |                0 |
| B    | blue | second |            0 |                1 |
| A    |green | second |            0 |                0 |
| B    |green | second |            0 |                1 |
| A    |blue  | third  |            0 |                0 |
| B    |blue  | third  |            0 |                0 |
| A    |green | third  |            1 |                0 |
| B    |green | third  |            1 |                0 |
+------+------+--------+--------------+------------------+

如何重塑我的数据，使其看起来像我的最后一个例子？我还在使用unstack和group函数吗？

Answer 1

您可以使用pivot_table()：

In [130]: df['count'] = 1

In [134]: (df.pivot_table(index=['class','colour','step_name'], columns='title',
   .....:                 values='count', aggfunc='sum', fill_value=0)
   .....:    .reset_index()
   .....: )
Out[134]:
title class colour step_name  acceptable  not acceptable
0         A   blue     first           3               0
1         A   blue    second           1               0
2         A  green     first           0               1
3         A  green     third           1               0
4         B   blue     first           1               0
5         B   blue    second           0               1
6         B  green    second           0               1
7         B  green     third           1               0

Answer 2

我认为您需要pivot_table aggfunc=len，reset_index和rename_axis（pandas 0.18.0中的新内容）：

df = df.pivot_table(index=['class','colour','step_name'], 
                    columns='title', 
                    aggfunc=len, 
                    values='participant', 
                    fill_value=0).reset_index().rename_axis(None, axis=1)
print df
      class colour step_name  acceptable  not acceptable
0         A   blue     first           3               0
1         A   blue    second           1               0
2         A  green     first           0               1
3         A  green     third           1               0
4         B   blue     first           1               0
5         B   blue    second           0               1
6         B  green    second           0               1
7         B  green     third           1               0

Pandas - 如何对多个变量进行分组和取消堆叠？

2 个答案: