我有一个像这样的Pandas数据框:
+---+--------+-------------+------------------+
| | ItemID | Description | Feedback |
+---+--------+-------------+------------------+
| 0 | 8988 | Tall Chair | I hated it |
+---+--------+-------------+------------------+
| 1 | 8988 | Tall Chair | Best chair ever |
+---+--------+-------------+------------------+
| 2 | 6547 | Big Pillow | Soft and amazing |
+---+--------+-------------+------------------+
| 3 | 6547 | Big Pillow | Horrific color |
+---+--------+-------------+------------------+
我希望连接"反馈"的值。列以逗号分隔的新列,其中ItemID匹配。像这样:
+---+--------+-------------+----------------------------------+
| | ItemID | Description | NewColumn |
+---+--------+-------------+----------------------------------+
| 0 | 8988 | Tall Chair | I hated it, Best chair ever |
+---+--------+-------------+----------------------------------+
| 1 | 6547 | Big Pillow | Soft and amazing, Horrific color |
+---+--------+-------------+----------------------------------+
我尝试过几种不同的枢轴,合并,堆叠等等,并且卡住了
我认为 NewColumn最终会成为一个数组,但我对Python来说还是一个新手,所以我不确定。
此外,最终,我将尝试将其用于文本分类(对于新的"描述"生成一些"反馈"标签[多类问题])
答案 0 :(得分:1)
在数据框上调用.groupby('ItemID')
,然后连接反馈列:
df.groupby('ItemID')['Feedback'].apply(lambda x: ', '.join(x))
答案 1 :(得分:1)
我认为您可以按ItemID
和Description
,groupby
join
和最后apply
列reset_index
进行Django cache
:
print df.groupby(['ItemID', 'Description'])['Feedback'].apply(', '.join).reset_index(name='NewColumn')
ItemID Description NewColumn
0 6547 Big Pillow Soft and amazing, Horrific color
1 8988 Tall Chair I hated it, Best chair ever
如果您不需要Description
列:
print df.groupby(['ItemID'])['Feedback'].apply(', '.join).reset_index(name='NewColumn')
ItemID NewColumn
0 6547 Soft and amazing, Horrific color
1 8988 I hated it, Best chair ever