在熊猫数据框列中,我需要从数千个公司名称的末尾删除诸如LLC,INC,CO之类的数千个常用词。以下内容删除了任何位置的常用词:
toexlude = dfwcomwords['ending'].tolist()
data['names'] = data['names'].apply(lambda x: ' '.join([word for word in x.split() if word not in (toexclude)]))
但是我只想删除名称末尾的单词,即“ INC INTERNATIONAL LLC”应为“ INC INTERNATIONAL”。 (以上内容使其为“国际”。)任何帮助将不胜感激。
编辑:按照下面的@ba_ul建议,我收到不平衡的括号错误
for word in toexclude:
data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))
Traceback (most recent call last):
File "<ipython-input-139-c68049bc0f0d>", line 2, in <module>
data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))
File "/anaconda3/envs/pandas/lib/python3.7/site-packages/pandas/core/series.py", line 4042, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
File "<ipython-input-139-c68049bc0f0d>", line 2, in <lambda>
data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))
File "/anaconda3/envs/pandas/lib/python3.7/re.py", line 192, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/anaconda3/envs/pandas/lib/python3.7/re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "/anaconda3/envs/pandas/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/anaconda3/envs/pandas/lib/python3.7/sre_parse.py", line 944, in parse
raise source.error("unbalanced parenthesis")
error: unbalanced parenthesis
答案 0 :(得分:1)
您可以检查word
的两个条件:(1)是否在toexclude
中;(2)它是否是公司名称中的最后一个单词。
toexlude = dfwcomwords['ending'].tolist()
def remove_suffix(x):
x_list = x.split()
return ' '.join([word for index, word in enumerate(x_list) if not (word in toexclude and index == len(x_list) - 1)])
data['names'] = data['names'].apply(remove_suffix)
编辑:对于包含空格的后缀,可以先使用正则表达式和pandas的str.replace
函数将其删除。
data['names'] = data['names'].str.replace('S. A. R. L.$', '')
# If you have multiple such unusual suffixes, you can chain all of them together
data['names'] = data['names'].str.replace('S. A. R. L.$', '').str.replace('L L C$', '')
正则表达式中的 $
可确保您仅删除名称末尾出现的内容。
编辑#2:基于新注释,纯正则表达式解决方案可能是最好的。只有三行,应该涵盖所有情况。
import re
for word in toexclude:
data['names'] = data['names'].apply(lambda x: re.sub(r'\b{}$'.format(re.escape(word)), '', x, flags=re.IGNORECASE))
答案 1 :(得分:1)
如下更改支票:
data['names'] = data['names'].apply(
lambda x: ' '.join([word for i, word in enumerate(x.split()) if not (
i == len(x.split()) - 1 and word in toexclude)]))