Question

我的pandas df包含一个包含逗号分隔特征的列，如下所示：

Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect

我想将此列拆分为多个虚拟变量列，但无法弄清楚如何启动此过程。我试图像这样拆分列：

df['incident_characteristics'].str.split(',', expand=True)

然而，这并不起作用，因为描述中间有逗号。相反，我需要根据逗号后跟空格和大写字母的正则表达式匹配进行拆分。 str.split可以使用正则表达式吗？如果是这样，这是怎么做到的？

我认为这个正则表达式会做我需要的：

,\s[A-Z]

Answer 1

是的，split支持正则表达式。根据您的要求

基于逗号的正则表达式匹配，后跟空格和大写字母

你可以使用

df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)

请参阅regex demo。

<强>详情

\s*,\s* - 用0 +空格包围的逗号
(?=[A-Z]) - 仅在后跟大写ASCII字母

但是，您似乎也不想匹配括号内的逗号，如果在当前位置的右侧立即有0+其他字符，则添加(?![^()]*\))预测未通过匹配而不是(和)，然后是)：

r'\s*,\s*(?=[A-Z])(?![^()]*\))'

它会阻止在括号内的大写单词之前匹配逗号（内部没有括号）。

请参阅another regex demo。

Answer 2

你可以尝试function isWindows10S(){ return window.navigator.userAgent.indexOf("ServiceUI") !== -1; }（但我认为有比我更好的模式）。

.str.extractall

输出：

import pandas as pd

txt = 'Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect)'
df = pd.DataFrame({'incident_characteristics': [txt]})
df['incident_characteristics'].str.extractall(r'([\w\+\-\/ ]+(\([\w\+\-\/\, ]+\))?)')[0]

如果您使用# match # 0 0 Shot - Wounded/Injured # 1 Shot - Dead (murder, accidental, suicide) # 2 Suicide - Attempt # 3 Murder/Suicide # 4 Attempted Murder/Suicide (one variable unsucc... # 5 Institution/Group/Business # 6 Mass Murder (4+ deceased victims excluding th... # 7 Mass Shooting (4+ victims injured or killed e... # Name: 0, dtype: object，则第一个字母将被删除，因为它将用作分隔符的一部分。

.str.split

输出：

df['incident_characteristics'].str.split(r',\s[A-Z]', expand=True)

Answer 3

我首先创建数据然后将其提供给数据框，如此

import pandas as pd, re

junk = """Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect"""

rx = re.compile(r'\([^()]+\)|,(\s+)')

data = [x 
        for nugget in rx.split(junk) if nugget
        for x in [nugget.strip()] if x]

df = pd.DataFrame({'incident_characteristics': data})
print(df)

这会产生

                            incident_characteristics
0                             Shot - Wounded/Injured
1                                        Shot - Dead
2                                  Suicide - Attempt
3                                     Murder/Suicide
4                           Attempted Murder/Suicide
5                         Institution/Group/Business
6                                        Mass Murder
7  Mass Shooting (4+ victims injured or killed ex...

此外，这假设在分割时应忽略括号中的逗号。

大熊猫在正则表达式上分裂

3 个答案: