Question

我有一个字符串：

str = 'i have a banana and an apple'

我也有数据框

name    new_name
have     had
bed      eat
banana   lime

如果pandas df中存在单词，我想替换字符串中的单词。

for eg（对于我的str =输出应该是。

'i had a lime and an apple'

我正在尝试定义一个函数

def replace(df,string):
    L = []
    for i in string:
        new_word = df[[new_name]].loc[df.name==i].item()
        if not new_word:
             new_word = i
    L.append(new_word)
    result_str = ' '.join(map(str, L))
    return result_str

但这似乎非常长，是否有更好的方法（时间效率）来获得这样的输出？

Answer 1

选项1

将字符串拆分为自然分隔符（空格）
致电pd.Series.replace，并将new_name作为参数传递
将系列中的单元格与str.cat / str.join

m = df.set_index('name').new_name

pd.Series(string.split()).replace(m).str.cat(sep=' ')
'i had a lime and an apple'

string是原始字符串。不要使用str来定义变量，这会隐藏内置类的相同名称。

或者，调用str.join应该比str.cat -

' '.join(pd.Series(string.split()).replace(m).tolist())
'i had a lime and an apple'

从现在开始，我将使用这种连接字符串的方法，你也可以在即将推出的选项中看到它。

选项2
您可以跳过pandas，而是使用re.sub：

import re

m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))

re.sub(p, lambda x: m.get(x.group()), string)
'i had a lime and an apple'

<强>性能

string = 'i have a banana and an apple ' * 10000

# Series-`replacement

%%timeit
m = df.set_index('name').new_name
' '.join(pd.Series(string.split()).replace(m).tolist())

100 loops, best of 3: 20.3 ms per loop

# `re`gex replacement

%%timeit
m = df.set_index('name').new_name.to_dict()
p = r'\b{}\b'.format('|'.join(df.name.tolist()))
re.sub(p, lambda x: m.get(x.group()), string)

10 loops, best of 3: 30.7 ms per loop

Answer 2

使用replace参数regex=True：

a = 'i have a banana and an apple'

b = pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
print (b)
i had a lime and an apple

另一种解决方案：

a = 'i have a banana and an apple'

import re
d = df.set_index('name')['new_name'].to_dict()
p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
b = p.sub(lambda x: d[x.group()], a)
print (b)
i had a lime and an apple

<强>计时：

a = 'i have a banana and an apple' * 1000

In [205]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
100 loops, best of 3: 2.52 ms per loop

In [206]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
1000 loops, best of 3: 1.43 ms per loop


In [208]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
100 loops, best of 3: 3.11 ms per loop


In [211]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
100 loops, best of 3: 2.91 ms per loop

a = 'i have a banana and an apple' * 10000

In [213]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
     ...: 
100 loops, best of 3: 19.8 ms per loop

In [214]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
100 loops, best of 3: 4.1 ms per loop

In [215]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
10 loops, best of 3: 26.3 ms per loop

In [216]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
10 loops, best of 3: 22.8 ms per loop

将字符串中的单词替换为pandas dataframe

2 个答案: