我的数据框包含以下数据
callerid seq text
1236 2 I need to talk to x
1236 6 Issue 3 is this
1236 3 This is regarding abc
1236 5 Issue 2 is this
1236 4 Issue 1 is this
1236 1 Hi
1347 2 I need to talk to x
1347 6 Issue 3 is this
1347 3 This is regarding abc
1347 5 Issue 2 is this
1347 4 Issue 1 is this
1347 1 Hi
我需要按callerid分组数据,按seq排序,合并文本并写入另一个数据框
最终输出数据应如下所示
callerid text
1236 Hi I need to talk to X This is regarding abc Issue 1 is this Issue 2 is this Issue 3 is this
1347 Hi I need to talk to X This is regarding abc Issue 1 is this Issue 2 is this Issue 3 is this
我尝试了以下代码
documentext = dataextract.sort_values(['callerid','seq']).groupby('callerid')
documenttext1 = documenttext[['callerid','text']]
documentext1 = (documenttext1.groupby('callerid')['text']
.apply(lambda x: ' '.join(set(x.dropna())))
.reset_index())
第一句话没有给我完整的排序文本 这是我得到的输出
callerid seq text
1236 1 Hi
1236 2 I need to talk to x
1236 3 This is regarding abc
1347 1 Hi
1347 2 I need to talk to x
1347 3 This is regarding abc
对此表示感谢
预先感谢
答案 0 :(得分:2)
您猜到了,第一步是排序,第二步是分组。您可以使用' '.join
作为aggfunc连接字符串。
(df.sort_values('seq')
.groupby('callerid', sort=False)['text']
.agg(' '.join)
.reset_index())
callerid text
0 1236 Hi I need to talk to x This is regarding abc I...
1 1347 Hi I need to talk to x This is regarding abc I...
您不应该对“ seq”进行分组,因为您正试图汇总整个。
答案 1 :(得分:1)
更像索引sum
(' '+df.set_index(['callerid','seq']).\
sort_index([0,1]).text).\
sum(level=0,axis=0).str.strip().reset_index()