我正在尝试查找和替换20K条注释中的单词。查找和替换单词存储在数据框中,大约存储2万多个。不同数据框中的注释,大约存储2万个。
下面是示例
import pandas as pd
df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})
df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
'Replace' : ["Insurance","Blocked"],
})
我期望下面的输出
op = ["Hull Damage happened and its insured by maritime hull insurance company","Blocked and claims are blocked"]})
请帮助。
我正在使用循环,但是要花20多分钟才能完成。 数据中有2万条记录,需要替换30000个字
“” KeywordSynonym“”-数据框保存sql中的查找和替换数据
“” backup“”-数据框保留要清除的数据
backup = str(backup)
TrainingClaimNotes_KwdSyn = []
for index,row in KeywordSynonym.iterrows():
word = KeywordSynonym.Synonym[index].lower()
value = KeywordSynonym.Keyword[index].lower()
my_regex = r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)"
if re.search(my_regex,backup):
backup = re.sub(my_regex, value, backup)
TrainingClaimNotes_KwdSyn.append(backup)
TrainingClaimNotes_KwdSyn_Cmp = backup.split('\'", "\'')
答案 0 :(得分:1)
使用:
import pandas as pd
df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})
df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
'Replace' : ["Insurance","Blocked"],
})
find_repl = dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower()))
d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}
df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)
输出
>>> print(df1['Data_1'].tolist())
['hull damage happened and its insured by maritime hull insurance company', 'blocked and claims are blocked']
说明
dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower()))
在要替换的内容和要替换为的字符串之间创建映射-
{'insurence': 'insurance', 'non cash entry': 'blocked'}
将查找转换为regex
,使其可以进行查找-
d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}
{'(\\b)insurence(\\b)': '\\1insurance\\2', '(\\b)non cash entry(\\b)': '\\1blocked\\2'}
最后一件只是替换 actual -
df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)
注意:我到处都进行.lower()
来找到合适的匹配项。显然,您可以将其重塑为所需的外观。