我有一个看起来像这样的数据框:
word start stop speaker
0 but, 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay, 9.19 10.01 2
7 sure. 10.02 11.01 2
8 what? 11.02 12.00 1
9 i 12.01 13.00 2
10 agree, 13.01 14.00 2
11 but 14.01 15.00 2
12 i 15.01 16.00 2
13 disagree 16.01 17.00 2
14 that's 17.01 18.00 1
15 fine, 18.01 19.00 1
16 however 19.01 20.00 1
17 you 20.01 21.00 1
18 are 21.01 22.00 1
我想在说话者发生变化或出现标点符号时(不包括撇号)将“单词”中的所有单词归为一组。除了分组单词,我还希望将第一个单词“ start”和最后一个单词“ stop”分配给该组。我想要的内容如下:
word start stop speaker
0 but, 2.72 2.85 2
1 that's alright 2.85 3.47 2
2 we'll have to 8.43 9.07 1
3 okay, 9.19 10.01 2
4 sure. 10.02 11.01 2
5 what? 11.02 12.00 1
6 I agree, 12.01 14.00 2
7 but i disagree 14.01 17.00 2
8 that's fine, 17.01 19.00 1
9 however you are 19.01 22.00 1
对于完成此任务的任何建议将不胜感激。
答案 0 :(得分:1)
您可以检查最后一个字符是否在标点符号列表中,并按相反的总和分组:
punctuation = list(',.?!')
s = (df['word'].str.strip().str[-1].isin(punctuation) # punctuation
| df['speaker'].ne(df['speaker'].shift(-1)) # speaker change
)
s = s.iloc[::-1].cumsum().iloc[::-1]
# reverse order of s
s = s.max()-s
df.groupby(s).agg({'word':' '.join, 'start':'min', 'stop':'max', 'speaker': 'min'})
输出:
word start stop speaker
0 but, 2.72 2.85 2
1 that's alright 2.85 3.47 2
2 we'll have to 8.43 9.07 1
3 okay, 9.19 10.01 2
4 sure. 10.02 11.01 2
5 what? 11.02 12.00 1
6 i agree, 12.01 14.00 2
7 but i disagree 14.01 17.00 2
8 that's fine, 17.01 19.00 1
9 however you are 19.01 22.00 1
答案 1 :(得分:0)
尝试str.extract
,在自定义遮罩s1
,s2
和agg
上分组分组
s = df.word.str.extract(r"([^\w\s'])", expand=False).notna()
s1 = s.cumsum() - s
s2 = df.speaker.diff().ne(0).cumsum()
(df.groupby([s1, s2], sort=False, as_index=False)
.agg({'word': ' '.join, 'start': 'first', 'stop': 'last', 'speaker': 'first'}))
Out[70]:
word start stop speaker
0 but, 2.72 2.85 2
1 that's alright 2.85 3.47 2
2 we'll have to 8.43 9.07 1
3 okay, 9.19 10.01 2
4 sure. 10.02 11.01 2
5 what? 11.02 12.00 1
6 i agree, 12.01 14.00 2
7 but i disagree 14.01 17.00 2
8 that's fine, 17.01 19.00 1
9 however you are 19.01 22.00 1