Question

我有一个像这样的Pandas数据框：

+---+--------+-------------+------------------+
|   | ItemID | Description | Feedback         |
+---+--------+-------------+------------------+
| 0 | 8988   | Tall Chair  | I hated it       |
+---+--------+-------------+------------------+
| 1 | 8988   | Tall Chair  | Best chair ever  |
+---+--------+-------------+------------------+
| 2 | 6547   | Big Pillow  | Soft and amazing |
+---+--------+-------------+------------------+
| 3 | 6547   | Big Pillow  | Horrific color   |
+---+--------+-------------+------------------+

我希望连接＆＃34;反馈＆＃34;的值。列以逗号分隔的新列，其中ItemID匹配。像这样：

+---+--------+-------------+----------------------------------+
|   | ItemID | Description | NewColumn                        |
+---+--------+-------------+----------------------------------+
| 0 | 8988   | Tall Chair  | I hated it, Best chair ever      |
+---+--------+-------------+----------------------------------+
| 1 | 6547   | Big Pillow  | Soft and amazing, Horrific color |
+---+--------+-------------+----------------------------------+

我尝试过几种不同的枢轴，合并，堆叠等等，并且卡住了我认为 NewColumn最终会成为一个数组，但我对Python来说还是一个新手，所以我不确定。
此外，最终，我将尝试将其用于文本分类（对于新的＆＃34;描述＆＃34;生成一些＆＃34;反馈＆＃34;标签[多类问题]）

Answer 1

在数据框上调用.groupby('ItemID')，然后连接反馈列：

df.groupby('ItemID')['Feedback'].apply(lambda x: ', '.join(x))

请参阅Pandas groupby: How to get a union of strings。

Answer 2

我认为您可以按ItemID和Description，groupby join和最后apply列reset_index进行Django cache：

print df.groupby(['ItemID', 'Description'])['Feedback'].apply(', '.join).reset_index(name='NewColumn')
   ItemID Description                         NewColumn
0    6547  Big Pillow  Soft and amazing, Horrific color
1    8988  Tall Chair       I hated it, Best chair ever

如果您不需要Description列：

print df.groupby(['ItemID'])['Feedback'].apply(', '.join).reset_index(name='NewColumn')
   ItemID                         NewColumn
0    6547  Soft and amazing, Horrific color
1    8988       I hated it, Best chair ever

Pandas DataFrame中的列数组的行值

2 个答案: