我已经将pdf提取到数据框中,并且如果B列是同一说话者,我想合并行:
发件人:
Index Column B Column C
1 'I am going' Speaker A
2 'to the zoo' Speaker A
3 'I am going' Speaker B
4 'home ' Speaker B
5 'I am going' Speaker A
6 'to the park' Speaker A
收件人:
Index Column B Column C
1 'I am going to the zoo ' Speaker A
2 'I am going home' Speaker B
3 'I am going to the park' Speaker A
我尝试使用groupby,但是顺序在pdf(即语音)的上下文中很重要。
答案 0 :(得分:2)
在创建确定列C何时更改的系列后,您可以使用GroupBy
+ agg
res = df.assign(key=df['Column C'].ne(df['Column C'].shift()).cumsum())\
.groupby('key').agg({'Column C': 'first', 'Column B': ' '.join})\
.reset_index()
print(res)
key Column C Column B
0 1 Speaker A 'I am going' 'to the zoo'
1 2 Speaker B 'I am going' 'home '
2 3 Speaker A 'I am going' 'to the park'
请注意,根据您提供的输入,输出带有引号。这些不会显示是否定义了不带引号的字符串。
答案 1 :(得分:0)
使用groupby
和agg
,如下所示:
import pandas as pd
from functools import reduce
data = {'col1': [1,1,2,2,3], 'col2': ['foo', 'bar', 'baz', 'bag', 'bat']}
df = pd.DataFrame(data)
print(df)
aggregated = df.groupby('col1').agg(lambda x: reduce(lambda s1, s2: s1 + s2, x))
print(aggregated)
将产生以下输出:
col1 col2
0 1 foo
1 1 bar
2 2 baz
3 2 bag
4 3 bat
col2
col1
1 foobar
2 bazbag
3 bat