Question

我有一个数据框，它是2人对话的笔录。在df中是单词，它们的时间戳和说话者的标签。看起来像这样。

      word    start  stop      speaker
0       but   2.72  2.85        2
1    that's   2.85  3.09        2
2   alright   3.09  3.47        2
3     we'll   8.43  8.69        1
4      have   8.69  8.97        1
5        to   8.97  9.07        1
6      okay   9.19 10.01        2
7      sure  10.02 11.01        2
8     what?  11.02 12.00        1
9         i  12.01 13.00        2
10     agree 13.01 14.00        2
11       but 14.01 15.00        2
12       i   15.01 16.00        2
13  disagree 16.01 17.00        2
14    thats  17.01 18.00        1
15     fine  18.01 19.00        1 
16   however 19.01 20.00        1         
17       you 20.01 21.00        1
18       are 21.01 22.00        1
19      like 22.01 23.00        1
20      this 23.01 24.00        1
21       and 24.01 25.00        1

我有代码将每个说话者的所有单词组合成一种发音，从而保留时间戳和说话者标签。使用此代码：

df.groupby([(df['speaker'] != df['speaker'].shift()).cumsum(), , df['speaker']], as_index=False).agg({
    'word': ' '.join,
    'start': 'min',
    'stop': 'max'
})

我明白了：

       word        start  stop speaker
0  but that's alright  2.72  3.47  2
1       we'll have to  8.43  9.07  1
2           okay sure  9.19 11.01  2
3               what? 11.02 12.00  1

但是，我想根据连接语副词（“ however”，“ and”，“ but”等）的存在，将这些组合的话语分为子话语。结果，我想要这个：

       word        start  stop speaker
0  but that's alright  2.72  3.47  2
1       we'll have to  8.43  9.07  1
2           okay sure  9.19 11.01  2
3               what? 11.02 12.00  1
4             I agree 12.01 14.00  2
5      but i disagree 14.01 17.00  2
6          thats fine 17.01 19.00  1
7     however you are 19.01 22.00  1
8           like this 22.01 24.00  1
9                 and 24.01 25.00  1

对于完成此任务的任何建议将不胜感激。

Answer 1

您可以添加OR（|）并在分组之前检查word是否在特定列表内（例如，使用df['word'].isin(['however', 'and', 'but'])）：

df.groupby([((df['speaker'] != df['speaker'].shift()) | (df['word'].isin(['however', 'and', 'but'])) ).cumsum(), df['speaker']], as_index=False).agg({
    'word': ' '.join,
    'start': 'min',
    'stop': 'max'
})

如何在python中通过联合拆分文本字符串？

1 个答案: