我无法在stackoverflow中找到解决方案以根据字典中的值在列表中的位置进行替换。
词典
dct = {"LOL": ["laught out loud", "laught-out loud"],
"TLDR": ["too long didn't read", "too long; did not read"],
"application": ["app"]}
输入
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
("laught-out loud so I couldnt too long; did not read"),
("what happened?")], columns=['text'])
预期产量
output_df = pd.DataFrame([("haha TLDR and LOL :D"),
("LOL so I couldnt TLDR"),
("what happened?")], columns=['text'])
修改
在字典中添加了一个附加条目,即“ application”:[“ app”]
当前的解决方案将输出显示为“应用了什么?”
请提出修复建议。
答案 0 :(得分:4)
构建一个反向映射,并将Series.replace
与regex=True
一起使用。
mapping = {v : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)
print(input_df)
text
0 haha TLDR and LOL :D
1 LOL so I couldnt TLDR
在哪里
print(mapping)
{'laught out loud': 'LOL',
'laught-out loud': 'LOL',
"too long didn't read": 'TLDR',
'too long; did not read': 'TLDR'}
要匹配完整单词,请为每个单词添加单词边界:
mapping = {rf'\b{v}\b' : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)
print(input_df)
text
0 haha TLDR and LOL :D
1 LOL so I couldnt TLDR
2 what happened?
在哪里
print(mapping)
{'\\bapp\\b': 'application',
'\\blaught out loud\\b': 'LOL',
'\\blaught-out loud\\b': 'LOL',
"\\btoo long didn't read\\b": 'TLDR',
'\\btoo long; did not read\\b': 'TLDR'}
答案 1 :(得分:1)
使用df.apply
和自定义函数
例如:
import pandas as pd
def custReplace(value):
dct = {"LOL": ["laught out loud", "laught-out loud"],
"TLDR": ["too long didn't read", "too long; did not read"]
}
for k, v in dct.items():
for i in v:
if i in value:
value = value.replace(i, k)
return value
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
("laught-out loud so I couldnt too long; did not read")], columns=['text'])
print(input_df["text"].apply(custReplace))
输出:
0 haha TLDR and LOL :D
1 LOL so I couldnt TLDR
Name: text, dtype: object
或
dct = {"LOL": ["laught out loud", "laught-out loud"],
"TLDR": ["too long didn't read", "too long; did not read"]
}
dct = { "(" + "|".join(v) + ")": k for k, v in dct.items()}
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
("laught-out loud so I couldnt too long; did not read")], columns=['text'])
print(input_df["text"].replace(dct, regex=True))
答案 2 :(得分:1)
这就是我要去的地方:
import pandas as pd
dct = {"LOL": ["laught out loud", "laught-out loud"],
"TLDR": ["too long didn't read", "too long; did not read"]
}
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
("laught-out loud so I couldnt too long; did not read")], columns=['text'])
dct_inv = {}
for key, vals in dct.items():
for val in vals:
dct_inv[val]=key
dct_inv
def replace_text(input_str):
for key, val in dct_inv.items():
input_str = str(input_str).replace(key, val)
return input_str
input_df.apply(replace_text, axis=1).to_frame()
答案 3 :(得分:1)
我认为最合乎逻辑的出发点是反转字典,以便您的键是映射到新字符串值的原始字符串。您可以手动执行此操作,也可以使用百万种其他方式执行操作,例如:
import itertools
dict_rev = dict(itertools.chain.from_iterable([list(zip(v, [k]*len(v))) for k, v in dct.items()]))
哪个不是超级可读的。或这个看起来更好的东西,我偷走了另一个答案:
dict_rev = {v : k for k, V in dct.items() for v in V}
这要求字典中的每个值都在列表(或其他可迭代的)内,例如"new key": ["single_val"]
,否则它将爆炸字符串中的每个字符。
然后您可以执行以下操作(基于此处的How to replace multiple substrings of a string?代码)
import re
rep = dict((re.escape(k), v) for k, v in dict_rev.items())
pattern = re.compile("|".join(rep.keys()))
input_df["text"] = input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])
此方法的执行速度比更简单,更优雅的解决方案快3倍:
简单:
%timeit input_df["text"].replace(dict_rev, regex=True)
425 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
更快:
%timeit input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])
160 µs ± 7.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)