按主题分组,然后将一列字符串折叠为相应的类别

时间:2018-09-24 20:11:46

标签: python pandas dataframe nlp

给出:

import pandas as pd

lis1= ('baseball', 'basketball', 'baseball', 'hockey', 'hockey', 'basketball')
lis2= ('I had lots of fun', 'This was the most boring sport', "I hit the ball hard", 'the puck went too fast', 'I scored a goal', 'the basket was broken')

pd.DataFrame({'topic':lis1, 'review':lis2})

        topic                          review
0    baseball               I had lots of fun
1  basketball  This was the most boring sport
2    baseball             I hit the ball hard
3      hockey          the puck went too fast
4      hockey                 I scored a goal
5  basketball           the basket was broken

我需要将此作为pd.DataFrame:

lis1= ('baseball', 'basketball', 'hockey')
lis2= ("I had lots of fun, I hit the ball hard", "This was the most boring sport, the basket was broken","the puck went too fast I scored a goal")

pd.DataFrame({'topic':lis1, 'review':lis2})

        topic                                             review
0    baseball             I had lots of fun, I hit the ball hard
1  basketball  This was the most boring sport, the basket was...
2      hockey             the puck went too fast I scored a goal

我很困惑,因为我想要分组的列是一个字符串,并且我想将字符串组合在一起。字符串不必用逗号分隔。

1 个答案:

答案 0 :(得分:2)

使用groupby并通过str.join聚合字符串:

df.groupby('topic', as_index=False).agg({'review' : ', '.join})

        topic                                             review
0    baseball             I had lots of fun, I hit the ball hard
1  basketball  This was the most boring sport, the basket was...
2      hockey            the puck went too fast, I scored a goal

或者,groupby并调用apply,语法略有不同:

df.groupby('topic')['review'].apply(', '.join).reset_index()

        topic                                             review
0    baseball             I had lots of fun, I hit the ball hard
1  basketball  This was the most boring sport, the basket was...
2      hockey            the puck went too fast, I scored a goal