将字符串中的单词替换为pandas dataframe

时间:2018-01-22 13:30:29

标签: python pandas

我有一个字符串:

str = 'i have a banana and an apple'

我也有数据框

name    new_name
have     had
bed      eat
banana   lime

如果pandas df中存在单词,我想替换字符串中的单词。

for eg(对于我的str =输出应该是。

'i had a lime and an apple'

我正在尝试定义一个函数

def replace(df,string):
    L = []
    for i in string:
        new_word = df[[new_name]].loc[df.name==i].item()
        if not new_word:
             new_word = i
    L.append(new_word)
    result_str = ' '.join(map(str, L))
    return result_str

但这似乎非常长,是否有更好的方法(时间效率)来获得这样的输出?

2 个答案:

答案 0 :(得分:2)

选项1

  1. 将字符串拆分为自然分隔符(空格)
  2. 致电pd.Series.replace,并将new_name作为参数传递
  3. 将系列中的单元格与str.cat / str.join
  4. 合并

    m = df.set_index('name').new_name
    
    pd.Series(string.split()).replace(m).str.cat(sep=' ')
    'i had a lime and an apple'
    

    string是原始字符串。不要使用str来定义变量,这会隐藏内置类的相同名称。

    或者,调用str.join应该比str.cat -

    更快
    ' '.join(pd.Series(string.split()).replace(m).tolist())
    'i had a lime and an apple'
    

    从现在开始,我将使用这种连接字符串的方法,你也可以在即将推出的选项中看到它。

    选项2
    您可以跳过pandas,而是使用re.sub

    import re
    
    m = df.set_index('name').new_name.to_dict()
    p = r'\b{}\b'.format('|'.join(df.name.tolist()))
    
    re.sub(p, lambda x: m.get(x.group()), string)
    'i had a lime and an apple'
    

    <强>性能

    string = 'i have a banana and an apple ' * 10000
    

    # Series-`replacement
    
    %%timeit
    m = df.set_index('name').new_name
    ' '.join(pd.Series(string.split()).replace(m).tolist())
    
    100 loops, best of 3: 20.3 ms per loop
    

    # `re`gex replacement
    
    %%timeit
    m = df.set_index('name').new_name.to_dict()
    p = r'\b{}\b'.format('|'.join(df.name.tolist()))
    re.sub(p, lambda x: m.get(x.group()), string)
    
    10 loops, best of 3: 30.7 ms per loop
    

答案 1 :(得分:1)

使用replace参数regex=True

a = 'i have a banana and an apple'

b = pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
print (b)
i had a lime and an apple

另一种解决方案:

a = 'i have a banana and an apple'

import re
d = df.set_index('name')['new_name'].to_dict()
p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
b = p.sub(lambda x: d[x.group()], a)
print (b)
i had a lime and an apple

<强>计时

a = 'i have a banana and an apple' * 1000

In [205]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
100 loops, best of 3: 2.52 ms per loop

In [206]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
1000 loops, best of 3: 1.43 ms per loop


In [208]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
100 loops, best of 3: 3.11 ms per loop


In [211]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
100 loops, best of 3: 2.91 ms per loop
a = 'i have a banana and an apple' * 10000

In [213]: %%timeit
     ...: import re
     ...: d = df.set_index('name')['new_name'].to_dict()
     ...: p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
     ...: b = p.sub(lambda x: d[x.group()], a)
     ...: 
     ...: 
100 loops, best of 3: 19.8 ms per loop

In [214]: %%timeit
     ...: pd.Series(a).replace(df.set_index('name')['new_name'], regex=True)[0]
     ...: 
100 loops, best of 3: 4.1 ms per loop

In [215]: %%timeit
     ...: m = df.set_index('name').new_name
     ...: 
     ...: pd.Series(a.split()).replace(m).str.cat(sep=' ')
     ...: 
10 loops, best of 3: 26.3 ms per loop

In [216]: %%timeit
     ...: m = df.set_index('name').new_name.to_dict()
     ...: p = r'\b{}\b'.format(df.name.str.cat(sep='|'))
     ...: 
     ...: re.sub(p, lambda x: m.get(x.group()), a)
     ...: 
10 loops, best of 3: 22.8 ms per loop