如何通过标点符号分割大熊猫中的字符串

时间:2019-09-12 00:33:00

标签: python pandas

我有一个看起来像这样的数据框:

      word    start  stop      speaker
0      but,   2.72  2.85        2
1    that's   2.85  3.09        2
2   alright   3.09  3.47        2
3     we'll   8.43  8.69        1
4      have   8.69  8.97        1
5        to   8.97  9.07        1
6     okay,   9.19 10.01        2
7     sure.  10.02 11.01        2
8     what?  11.02 12.00        1
9         i  12.01 13.00        2
10    agree, 13.01 14.00        2
11       but 14.01 15.00        2
12       i   15.01 16.00        2
13  disagree 16.01 17.00        2
14   that's  17.01 18.00        1
15    fine,  18.01 19.00        1 
16   however 19.01 20.00        1         
17       you 20.01 21.00        1
18       are 21.01 22.00        1

我想在说话者发生变化或出现标点符号时(不包括撇号)将“单词”中的所有单词归为一组。除了分组单词,我还希望将第一个单词“ start”和最后一个单词“ stop”分配给该组。我想要的内容如下:

       word        start  stop speaker
0                but,  2.72  2.85  2
1      that's alright  2.85  3.47  2
2       we'll have to  8.43  9.07  1
3               okay,  9.19  10.01 2
4               sure. 10.02  11.01 2
5               what? 11.02  12.00 1
6            I agree, 12.01  14.00 2
7      but i disagree 14.01  17.00 2
8        that's fine, 17.01  19.00 1
9     however you are 19.01  22.00 1

对于完成此任务的任何建议将不胜感激。

2 个答案:

答案 0 :(得分:1)

您可以检查最后一个字符是否在标点符号列表中,并按相反的总和分组:

punctuation = list(',.?!')

s = (df['word'].str.strip().str[-1].isin(punctuation) # punctuation
     | df['speaker'].ne(df['speaker'].shift(-1))      # speaker change
    )
s = s.iloc[::-1].cumsum().iloc[::-1]

# reverse order of s
s = s.max()-s

df.groupby(s).agg({'word':' '.join, 'start':'min', 'stop':'max', 'speaker': 'min'})

输出:

              word  start   stop  speaker
0             but,   2.72   2.85        2
1   that's alright   2.85   3.47        2
2    we'll have to   8.43   9.07        1
3            okay,   9.19  10.01        2
4            sure.  10.02  11.01        2
5            what?  11.02  12.00        1
6         i agree,  12.01  14.00        2
7   but i disagree  14.01  17.00        2
8     that's fine,  17.01  19.00        1
9  however you are  19.01  22.00        1

答案 1 :(得分:0)

尝试str.extract,在自定义遮罩s1s2agg上分组分组

s = df.word.str.extract(r"([^\w\s'])", expand=False).notna()
s1 = s.cumsum() - s
s2 = df.speaker.diff().ne(0).cumsum()

(df.groupby([s1, s2], sort=False, as_index=False)
   .agg({'word': ' '.join, 'start': 'first', 'stop': 'last', 'speaker': 'first'}))

Out[70]:
              word  start   stop  speaker
0             but,   2.72   2.85        2
1   that's alright   2.85   3.47        2
2    we'll have to   8.43   9.07        1
3            okay,   9.19  10.01        2
4            sure.  10.02  11.01        2
5            what?  11.02  12.00        1
6         i agree,  12.01  14.00        2
7   but i disagree  14.01  17.00        2
8     that's fine,  17.01  19.00        1
9  however you are  19.01  22.00        1