我目前有一些结构如下的数据集:
data = {'participant': [100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
'step_name': ['first', 'first', 'second', 'third', 'second', 'first', 'first', 'first', 'second', 'third'],
'title': ['acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'not acceptable', 'acceptable', 'acceptable', 'acceptable'],
'colour': ['blue', 'blue', 'blue', 'green', 'green', 'blue', 'green', 'blue', 'blue', 'green'],
'class': ['A', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'B']}
df = pd.DataFrame(data, columns=['participant', 'step_name', 'title', 'colour', 'class'])
看起来像:
+----+---------------+-------------+----------------+----------+---------+
| | participant | step_name | title | colour | class |
|----+---------------+-------------+----------------+----------+---------|
| 0 | 100 | first | acceptable | blue | A |
| 1 | 101 | first | acceptable | blue | B |
| 2 | 102 | second | not acceptable | blue | B |
| 3 | 103 | third | acceptable | green | A |
| 4 | 104 | second | not acceptable | green | B |
| 5 | 105 | first | acceptable | blue | A |
| 6 | 106 | first | not acceptable | green | A |
| 7 | 107 | first | acceptable | blue | A |
| 8 | 108 | second | acceptable | blue | A |
| 9 | 109 | third | acceptable | green | B |
+----+---------------+-------------+----------------+----------+---------+
现在我想聚合数据集,以便每行计算每个重复变量,我目前设法按照两个变量(step_name
和title
)进行计算,如下所示:
count_df = df[['participant', 'step_name', 'title']].groupby(['step_name', 'title']).count()
count_df = count_df.unstack()
count_df.fillna(0, inplace=True)
count_df.columns = count_df.columns.get_level_values(1)
count_df
+--------+--------------+------------------+
| | acceptable | not acceptable |
|--------+--------------+------------------|
| first | 4 | 1 |
| second | 1 | 2 |
| third | 2 | 0 |
+--------+--------------+------------------+
现在,我希望有一组额外的列,其中包含其他变量的值(colour
和class
) - 基本上,我想要分组然后取消堆栈那些变量,但我不知道如何使用2个以上的变量。最终,我希望我的决赛桌看起来像这样:
+------+------+--------+--------------+------------------+
|class |colour| step | acceptable | not acceptable |
|----------------------+--------------+------------------|
| A | blue | first | 3 | 0 |
| B | blue | first | 1 | 0 |
| A |green | first | 0 | 1 |
| B |green | first | 0 | 0 |
| A | blue | second | 1 | 0 |
| B | blue | second | 0 | 1 |
| A |green | second | 0 | 0 |
| B |green | second | 0 | 1 |
| A |blue | third | 0 | 0 |
| B |blue | third | 0 | 0 |
| A |green | third | 1 | 0 |
| B |green | third | 1 | 0 |
+------+------+--------+--------------+------------------+
如何重塑我的数据,使其看起来像我的最后一个例子?我还在使用unstack和group函数吗?
答案 0 :(得分:6)
您可以使用pivot_table():
In [130]: df['count'] = 1
In [134]: (df.pivot_table(index=['class','colour','step_name'], columns='title',
.....: values='count', aggfunc='sum', fill_value=0)
.....: .reset_index()
.....: )
Out[134]:
title class colour step_name acceptable not acceptable
0 A blue first 3 0
1 A blue second 1 0
2 A green first 0 1
3 A green third 1 0
4 B blue first 1 0
5 B blue second 0 1
6 B green second 0 1
7 B green third 1 0
答案 1 :(得分:6)
我认为您需要pivot_table
aggfunc=len
,reset_index
和rename_axis
(pandas
0.18.0
中的新内容):
df = df.pivot_table(index=['class','colour','step_name'],
columns='title',
aggfunc=len,
values='participant',
fill_value=0).reset_index().rename_axis(None, axis=1)
print df
class colour step_name acceptable not acceptable
0 A blue first 3 0
1 A blue second 1 0
2 A green first 0 1
3 A green third 1 0
4 B blue first 1 0
5 B blue second 0 1
6 B green second 0 1
7 B green third 1 0