我的pandas df包含一个包含逗号分隔特征的列,如下所示:
Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect
我想将此列拆分为多个虚拟变量列,但无法弄清楚如何启动此过程。我试图像这样拆分列:
df['incident_characteristics'].str.split(',', expand=True)
然而,这并不起作用,因为描述中间有逗号。相反,我需要根据逗号后跟空格和大写字母的正则表达式匹配进行拆分。 str.split可以使用正则表达式吗?如果是这样,这是怎么做到的?
我认为这个正则表达式会做我需要的:
,\s[A-Z]
答案 0 :(得分:3)
是的,split
支持正则表达式。根据您的要求
基于逗号的正则表达式匹配,后跟空格和大写字母
你可以使用
df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)
请参阅regex demo。
<强>详情
\s*,\s*
- 用0 +空格包围的逗号(?=[A-Z])
- 仅在后跟大写ASCII字母但是,您似乎也不想匹配括号内的逗号,如果在当前位置的右侧立即有0+其他字符,则添加(?![^()]*\))
预测未通过匹配而不是(
和)
,然后是)
:
r'\s*,\s*(?=[A-Z])(?![^()]*\))'
它会阻止在括号内的大写单词之前匹配逗号(内部没有括号)。
答案 1 :(得分:1)
你可以尝试function isWindows10S(){
return window.navigator.userAgent.indexOf("ServiceUI") !== -1;
}
(但我认为有比我更好的模式)。
.str.extractall
输出:
import pandas as pd
txt = 'Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect)'
df = pd.DataFrame({'incident_characteristics': [txt]})
df['incident_characteristics'].str.extractall(r'([\w\+\-\/ ]+(\([\w\+\-\/\, ]+\))?)')[0]
如果您使用# match
# 0 0 Shot - Wounded/Injured
# 1 Shot - Dead (murder, accidental, suicide)
# 2 Suicide - Attempt
# 3 Murder/Suicide
# 4 Attempted Murder/Suicide (one variable unsucc...
# 5 Institution/Group/Business
# 6 Mass Murder (4+ deceased victims excluding th...
# 7 Mass Shooting (4+ victims injured or killed e...
# Name: 0, dtype: object
,则第一个字母将被删除,因为它将用作分隔符的一部分。
.str.split
输出:
df['incident_characteristics'].str.split(r',\s[A-Z]', expand=True)
答案 2 :(得分:1)
我首先创建数据然后将其提供给数据框,如此
import pandas as pd, re
junk = """Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect"""
rx = re.compile(r'\([^()]+\)|,(\s+)')
data = [x
for nugget in rx.split(junk) if nugget
for x in [nugget.strip()] if x]
df = pd.DataFrame({'incident_characteristics': data})
print(df)
这会产生
incident_characteristics
0 Shot - Wounded/Injured
1 Shot - Dead
2 Suicide - Attempt
3 Murder/Suicide
4 Attempted Murder/Suicide
5 Institution/Group/Business
6 Mass Murder
7 Mass Shooting (4+ victims injured or killed ex...
此外,这假设在分割时应忽略括号中的逗号。