我正在努力想象在pandas中逐行迭代。
我有一个包含2方之间聊天对话的数据集。我想将数据集组合到人1和人2之间的逐行对话。有时人们会输入多个句子,这些句子将在数据框中显示为多个记录。
这是我回来的循环:
由于此数据集中有多个id表示人1和人2之间的每个会话记录,我希望循环由每个唯一ID运行。
id timestamp line_by line_text
1234 02:54.3 Person1 Text Line 1
1234 03:23.8 Person2 Text Line 2
1234 03:47.0 Person2 Text Line 3
1234 04:46.8 Person1 Text Line 4
1234 05:46.2 Person1 Text Line 5
9876 06:44.5 Person2 Text Line 6
9876 07:27.6 Person1 Text Line 7
9876 08:17.5 Person2 Text Line 8
9876 10:20.3 Person2 Text Line 9
我希望将数据更改为以下内容:
id timestamp line_by line_text
1234 02:54.3 Person1 Text Line 1
1234 03:47.0 Person2 Text Line 2Text Line 3
1234 05:46.2 Person1 Text Line 4Text Line 5
9876 06:44.5 Person2 Text Line 6
9876 07:27.6 Person1 Text Line 7
9876 10:20.3 Person2 Text Line 8Text Line 9
任何想法都表示赞赏。
答案 0 :(得分:2)
您可以groupby
连续line_by
和使用agg
汇总最新timestamp
和''.join
line_text
In [1918]: (df.groupby((df.line_by != df.line_by.shift()).cumsum(), as_index=False)
.agg({'id': 'first', 'timestamp': 'last', 'line_by': 'first',
'line_text': ''.join}))
Out[1918]:
timestamp line_text id line_by
0 02:54.3 Text Line 1 1234 Person1
1 03:47.0 Text Line 2Text Line 3 1234 Person2
2 05:46.2 Text Line 4Text Line 5 1234 Person1
3 06:44.5 Text Line 6 9876 Person2
4 07:27.6 Text Line 7 9876 Person1
5 10:20.3 Text Line 8Text Line 9 9876 Person2
详细
In [1919]: (df.line_by != df.line_by.shift()).cumsum()
Out[1919]:
0 1
1 2
2 2
3 3
4 3
5 4
6 5
7 6
8 6
Name: line_by, dtype: int32
In [1920]: df
Out[1920]:
id timestamp line_by line_text
0 1234 02:54.3 Person1 Text Line 1
1 1234 03:23.8 Person2 Text Line 2
2 1234 03:47.0 Person2 Text Line 3
3 1234 04:46.8 Person1 Text Line 4
4 1234 05:46.2 Person1 Text Line 5
5 9876 06:44.5 Person2 Text Line 6
6 9876 07:27.6 Person1 Text Line 7
7 9876 08:17.5 Person2 Text Line 8
8 9876 10:20.3 Person2 Text Line 9