Question

我有一个Pandas列包含这样的字符串：

(15:38) Hello, how are you? (15:39) I am fine. (15:40) That's good.

我想用时间标记分隔字符串，所以我使用了正则表达式： r'$\d{1,2}:\d{1,2}$' 我只想保留从第三个时间标记到结束的任何内容。所以期望的输出看起来像：

(15:40) That's good.

如果时间标记少于三个，请将该行设为空。

Answer 1

您可以使用(?:(?:$\d+:\d+$)[^$]+){2,}(\(\d+:\d+$.*$)提取您的模式的最后一个匹配项，以及 extract

如果任何对话框有括号，则无效。

示例DataFrame

                                                text
0  (15:38) Hello, how are you? (15:39) I am fine....

<强> extract

df.text.str.extract(r'(?:(?:\(\d+:\d+\))[^\(]+){2,}(\(\d+:\d+\).*$)')

0    (15:40) That's good.
Name: text, dtype: object

目前，如果只有少于三个单独的对话框，则会填充NaN，但如果您愿意，可以使用fillna替换为空字符串。

fillna

的示例

                                                text
0  (5:40) Hello there (3:20) Goodbye (3:30) This ...
1                     (3:30) Test 2 (5:45) Last text
2                              (4:30) Foo (5:18) Bar

df.text.str.extract(r'(?:(?:\(\d+:\d+\))[^\(]+){2,}(\(\d+:\d+\).*$)').fillna('')

0    (3:30) This has 3
1
2

熊猫分开字符串和条带

1 个答案: