我有一个如下所示的df:
words col_a col_b
I guess, because I have thought over that. Um, 1 0
That? yeah. 1 1
I don't always think you're up to something. 0 1
我想将出现标点符号的{d1。}}拆分为单独的行。但是我想为每个新行保留原始行的col_b和col_b值。例如,上面的df应该看起来像这样:
(.,?!:;)
答案 0 :(得分:5)
一种方法是将str.findall
与模式(.*?[.,?!:;])
配合使用,以匹配这些标点符号和其开头的字符(非贪婪),并爆炸结果列表:
(df.assign(words=df.words.str.findall(r'(.*?[.,?!:;])'))
.explode('words')
.reset_index(drop=True))
words col_a col_b
0 I guess, 1 0
1 because I have thought over that. 1 0
2 Um, 1 0
3 That? 1 1
4 yeah. 1 1
5 I don't always think you're up to something. 0 1