如何在python中通过联合拆分文本字符串?

时间:2019-06-24 20:48:03

标签: python nlp nltk

我有一个数据框,它是2人对话的笔录。在df中是单词,它们的时间戳和说话者的标签。看起来像这样。

      word    start  stop      speaker
0       but   2.72  2.85        2
1    that's   2.85  3.09        2
2   alright   3.09  3.47        2
3     we'll   8.43  8.69        1
4      have   8.69  8.97        1
5        to   8.97  9.07        1
6      okay   9.19 10.01        2
7      sure  10.02 11.01        2
8     what?  11.02 12.00        1
9         i  12.01 13.00        2
10     agree 13.01 14.00        2
11       but 14.01 15.00        2
12       i   15.01 16.00        2
13  disagree 16.01 17.00        2
14    thats  17.01 18.00        1
15     fine  18.01 19.00        1 
16   however 19.01 20.00        1         
17       you 20.01 21.00        1
18       are 21.01 22.00        1
19      like 22.01 23.00        1
20      this 23.01 24.00        1
21       and 24.01 25.00        1

我有代码将每个说话者的所有单词组合成一种发音,从而保留时间戳和说话者标签。使用此代码:

df.groupby([(df['speaker'] != df['speaker'].shift()).cumsum(), , df['speaker']], as_index=False).agg({
    'word': ' '.join,
    'start': 'min',
    'stop': 'max'
})

我明白了:

       word        start  stop speaker
0  but that's alright  2.72  3.47  2
1       we'll have to  8.43  9.07  1
2           okay sure  9.19 11.01  2
3               what? 11.02 12.00  1

但是,我想根据连接语副词(“ however”,“ and”,“ but”等)的存在,将这些组合的话语分为子话语。结果,我想要这个:

       word        start  stop speaker
0  but that's alright  2.72  3.47  2
1       we'll have to  8.43  9.07  1
2           okay sure  9.19 11.01  2
3               what? 11.02 12.00  1
4             I agree 12.01 14.00  2
5      but i disagree 14.01 17.00  2
6          thats fine 17.01 19.00  1
7     however you are 19.01 22.00  1
8           like this 22.01 24.00  1
9                 and 24.01 25.00  1

对于完成此任务的任何建议将不胜感激。

1 个答案:

答案 0 :(得分:0)

您可以添加OR|)并在分组之前检查word是否在特定列表内(例如,使用df['word'].isin(['however', 'and', 'but'])):

df.groupby([((df['speaker'] != df['speaker'].shift()) | (df['word'].isin(['however', 'and', 'but'])) ).cumsum(), df['speaker']], as_index=False).agg({
    'word': ' '.join,
    'start': 'min',
    'stop': 'max'
})  

enter image description here