我有一个字符串:
str = 'i have a banana and an apple'
我也有数据框
name new_name
have had
bed eat
banana lime
如果pandas df中存在单词,我想替换字符串中的单词。
for eg(对于我的str =输出应该是。
'i had a lime and an apple'
我正在尝试定义一个函数
def replace(df,string):
L = []
for i in string:
new_word = df[[new_name]].loc[df.name==i].item()
if not new_word:
new_word = i
L.append(new_word)
result_str = ' '.join(map(str, L))
return result_str
但这似乎非常长,是否有更好的方法(时间效率)来获得这样的输出?
答案 0 :(得分:2)
选项1
pd.Series.replace
,并将new_name
作为参数传递str.cat
/ str.join
m = df.set_index('name').new_name
pd.Series(string.split()).replace(m).str.cat(sep=' ')
'i had a lime and an apple'
string
是原始字符串。不要使用str
来定义变量,这会隐藏内置类的相同名称。
或者,调用str.join
应该比str.cat
-
' '.join(pd.Series(string.split()).replace(m).tolist())
'i had a lime and an apple'
从现在开始,我将使用这种连接字符串的方法,你也可以在即将推出的选项中看到它。
选项2
您可以跳过pandas,而是使用re.sub
:
import re
m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))
re.sub(p, lambda x: m.get(x.group()), string)
'i had a lime and an apple'
<强>性能强>
string = 'i have a banana and an apple ' * 10000
# Series-`replacement
%%timeit
m = df.set_index('name').new_name
' '.join(pd.Series(string.split()).replace(m).tolist())
100 loops, best of 3: 20.3 ms per loop
# `re`gex replacement
%%timeit
m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))
re.sub(p, lambda x: m.get(x.group()), string)
10 loops, best of 3: 30.7 ms per loop
答案 1 :(得分:1)
使用replace
参数regex=True
:
a = 'i have a banana and an apple'
b = pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
print (b)
i had a lime and an apple
另一种解决方案:
a = 'i have a banana and an apple'
import re
d = df.set_index('name')['new_name'].to_dict()
p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
b = p.sub(lambda x: d[x.group()], a)
print (b)
i had a lime and an apple
<强>计时强>:
a = 'i have a banana and an apple' * 1000
In [205]: %%timeit
...: import re
...: d = df.set_index('name')['new_name'].to_dict()
...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
...: b = p.sub(lambda x: d[x.group()], a)
...:
100 loops, best of 3: 2.52 ms per loop
In [206]: %%timeit
...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
...:
1000 loops, best of 3: 1.43 ms per loop
In [208]: %%timeit
...: m = df.set_index('name').new_name
...:
...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
...:
100 loops, best of 3: 3.11 ms per loop
In [211]: %%timeit
...: m = df.set_index('name').new_name.to_dict()
...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
...:
...: re.sub(p, lambda x: m.get(x.group()), a)
...:
100 loops, best of 3: 2.91 ms per loop
a = 'i have a banana and an apple' * 10000
In [213]: %%timeit
...: import re
...: d = df.set_index('name')['new_name'].to_dict()
...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
...: b = p.sub(lambda x: d[x.group()], a)
...:
...:
100 loops, best of 3: 19.8 ms per loop
In [214]: %%timeit
...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
...:
100 loops, best of 3: 4.1 ms per loop
In [215]: %%timeit
...: m = df.set_index('name').new_name
...:
...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
...:
10 loops, best of 3: 26.3 ms per loop
In [216]: %%timeit
...: m = df.set_index('name').new_name.to_dict()
...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
...:
...: re.sub(p, lambda x: m.get(x.group()), a)
...:
10 loops, best of 3: 22.8 ms per loop