如何通过标点符号拆分大熊猫列中的长字符串

时间:2020-04-20 20:24:40

标签: python pandas nlp

我有一个如下所示的df:

words                                              col_a   col_b  
I guess, because I have thought over that. Um,       1       0 
That? yeah.                                          1       1
I don't always think you're up to something.         0       1                                                       

我想将出现标点符号的{d1。}}拆分为单独的行。但是我想为每个新行保留原始行的col_b和col_b值。例如,上面的df应该看起来像这样:

(.,?!:;)

1 个答案:

答案 0 :(得分:5)

一种方法是将str.findall与模式(.*?[.,?!:;])配合使用,以匹配这些标点符号和其开头的字符(非贪婪),并爆炸结果列表:

(df.assign(words=df.words.str.findall(r'(.*?[.,?!:;])'))
   .explode('words')
   .reset_index(drop=True))

                                          words  col_a  col_b
0                                      I guess,      1      0
1             because I have thought over that.      1      0
2                                           Um,      1      0
3                                         That?      1      1
4                                         yeah.      1      1
5  I don't always think you're up to something.      0      1