我有一个带有Company
列的DataFrame。
Company
-------------------------------
Tundra Corporation Art Limited
Desert Networks Incorporated
Mount Yellowhive Security Corp
Carter, Rath and Mueller Limited (USD/AC)
Barrows corporation /PACIFIC
Corporation, Mounted Security
我有一本带有正则表达式的字典,用于规范公司实体。
(^|\s)corporation(\s|$); Corp
(^|\s)Limited(\s|$); LTD
(^|\s)Incorporated(\s|$); INC
...
我只需要标准化最后一次出现。这是我想要的输出。
Company
-------------------------------
Tundra Corporation Art LTD
Desert Networks INC
Mount Yellowhive Security Corp
Carter, Rath and Mueller LTD (USD/AC)
Barrows Corp /PACIFIC
Corp, Mounted Security
(仅规范Limited
的{{1}}和不 Corporation
)
我的代码:
Tundra Corporation Art Limited
是否可以仅更改最后一次出现的实体(我需要更改正则表达式)吗?
答案 0 :(得分:5)
将(\s|$)
更改为($)
以匹配字符串的结尾:
entity_dict = {'(^|\s)corporation($)': ' Corp',
'(^|\s)Limited($)': ' LTD',
'(^|\s)Incorporated($)': ' INC'}
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
编辑:您可以不使用正则表达式简化字典,然后创建小写字典以供可能使用Series.str.findall
,通过小写字典获得索引str[-1]
和Series.map
的最后一个值,最后替换列表中的理解力:
entity_dict = {'corporation': 'Corp',
'Limited': 'LTD',
'Incorporated': 'INC'}
lower = {k.lower():v for k, v in entity_dict.items()}
s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('')
s2 = s1.str.lower().map(lower).fillna('')
df['Company'] = [a.replace(b, c) for a, b, c in zip(df['Company'], s1, s2)]
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
3 Carter, Rath and Mueller LTD (USD/AC)
4 Barrows Corp /PACIFIC
5 Corp, Mounted Security