我有一个CSV文件,其中包含具有相似ID的行。我发现使用数据框执行此操作的一种不错的方法,并且我从this帖子中找到了执行此操作的代码。
示例CSv文件:
id messages
0 11 I am not driving home
1 11 Please pick me up
2 11 I don't have money
3 103 The car already park
4 103 No need for ticket
5 104 I will buy a car
6 104 I will buy a car
期望输出为:
示例CSv文件:
id messages
011 I am not driving home Please pick me up I don't have money
103 The car already park No need for ticket
104 I will buy a car
现在我到目前为止的代码是:
aggregation_functions = {'message':'sum'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)
现在我得到的这段代码是:
id messages
011 I am not driving homePlease pick me upI don't have money
103 The car already parkNo need for ticket
104 I will buy a car
我只想在单词之间留空格(例如“ homePlease”>“ home Please”),并避免重复,例如两次I will buy a car
。
我已经检查了帖子2,但找不到答案。
我还需要在.reindex(columns=df.columns)
之后使用aggregate(aggregation_functions)
赞:
df_new = df.groupby(df['id']).aggregate(aggregation_functions).reindex(columns=df.columns)
答案 0 :(得分:2)
您最好将apply
与join
结合使用:
>>> df
id messages
0 11 I am not driving home
1 11 Please pick me up
2 11 I don't have money
3 103 The car already park
4 103 No need for ticket
5 104 I will buy a car
6 104 I will buy a car
>>> df.groupby('id')['messages'].apply(lambda x: ' '.join(x))
id
11 I am not driving home Please pick me up I don'...
103 The car already park No need for ticket
104 I will buy a car I will buy a car
Name: messages, dtype: object
答案 1 :(得分:2)
要删除冗余,我建议在GroupBy.unique
之后加上str.join
:
df.groupby('id')['messages'].unique().str.join(' ')
或者,将GroupBy.agg
与set
+ ' '.join
结合使用:
df.groupby('id')['messages'].agg(lambda x: ' '.join(set(x)))
两个都打印
# id
# 11 I don't have money I am not driving home Pleas...
# 103 No need for ticket The car already park
# 104 I will buy a car
# Name: messages, dtype: object
要返回DataFrame,请在末尾调用reset_index
...
df.groupby('id')['messages'].unique().str.join(' ').reset_index()
# id messages
# 0 11 I am not driving home Please pick me up I don'...
# 1 103 The car already park No need for ticket
# 2 104 I will buy a car
答案 2 :(得分:2)
所以首先是drop_duplicates
,agg
join
df.drop_duplicates().groupby('id',as_index=False).messages.agg(' '.join)