Question

如何将以下内容与pandas extractall正则表达式匹配：

stringwithinmycolumn
stuff, Duration: 15h:22m:33s, notstuff,
stuff, Duration: 18h:22m:33s, notstuff,

目前，我使用以下内容：

df.message.str.extractall(r',([^,]*?): ([^,:]*?,').reset_index()

预期产出：

              0              1
match    
    0  Duration    15h:22m:33s
    1  Duration    18h:22m:33s

到目前为止我无法比赛。

Answer 1

您可以使用

,\s*([^,:]+):\s*([^,]+),

请参阅regex demo

匹配：

, - 逗号
\s* - 0+ whitespaces
([^,:]+) - 第1组： - 除,和:以外的0 +字符
: - 冒号
\s* - 0+ whitespaces
([^,]+) - 第2组：,
, - 一个逗号（实际上可以删除，但可以保留以确保更安全的匹配。）

请注意，当您需要从长字符串中提取结构化信息时，可以考虑使正则表达式更精确。因此，您可能希望使用字母匹配模式匹配Duration，并且只使用数字，冒号，h，m或s来提取时间值。因此，模式将变得更加冗长：

,\s*([A-Za-z]+):\s*([\d:hms]+)

但更安全。请参阅another regex demo。

Answer 2

In [246]: x.message.str.extractall(r',\s*(\w+):\s*([^,]*)').reset_index(level=0, drop=True)
Out[246]:
              0            1
match
0      Duration  15h:22m:33s
0      Duration  18h:22m:33s

pandas extractall匹配

2 个答案: