如何通过关键字和标点符号在熊猫df列中拆分文本

时间:2019-07-24 01:03:48

标签: python pandas

我有一个如下所示的df:

namespace RealmApp1.Views
{
    public partial class MainPage : ContentPage
    {
        public MainPage()
        {
            InitializeComponent();
            BindingContext = new MainPageViewModel();
        }      
    }
}

我有一些代码将其用副词拆分(并用split聚合其他列),但是我也想在出现标点符号时对其进行拆分。代码如下:

     word    start  stop      speaker
0      but,   2.72  2.85        2
1    that's   2.85  3.09        2
2   alright   3.09  3.47        2
3     we'll   8.43  8.69        1
4      have   8.69  8.97        1
5        to   8.97  9.07        1
6     okay!   9.19 10.01        2
7      sure  10.02 11.01        2
8     what?  11.02 12.00        1
9         i  12.01 13.00        2
10    agree, 13.01 14.00        2
11       but 14.01 15.00        2
12       i   15.01 16.00        2
13  disagree 16.01 17.00        2
14    thats  17.01 18.00        1
15     fine  18.01 19.00        1 
16   however 19.01 20.00        1         
17       you 20.01 21.00        1
18       are 21.01 22.00        1
19      like 22.01 23.00        1
20      this 23.01 24.00        1
21       and 24.01 25.00        1

我尝试将标点符号添加到拆分标准中失败了,

df.groupby([((df['speaker'] != df['speaker'].shift()) | (df['word'].isin(['however', 'and', 'but'])) ).cumsum(), df['speaker']], as_index=False).agg({
    'word': ' '.join,
    'start': 'min',
    'stop': 'max',
    'speaker':'max'
})  

这是我想要的最后一个df:

df.groupby([((df['speaker'] != df['speaker'].shift()) | (df['word'].isin(['however', 'and', 'but', ',', '.', '?'])) ).cumsum(), df['speaker']], as_index=False).agg({
    'word': ' '.join,
    'start': 'min',
    'stop': 'max',
    'speaker': 'max'
})  

请告知。

0 个答案:

没有答案