Python最快的n针到n草堆替换

时间:2015-05-08 15:21:03

标签: python string algorithm search replace

所以这就是我需要做的事情。

愚蠢的版本:

在列表A中,用列表B的下划线版本替换列表B中每个出现的子字符串。

我有一个名为Folder()的类来保存数据。

class Folder():
    dataset= [('question sentence', 'multiple word answer'),... n times]

    list_of_answers=['answer','multiple_word_answer',... n times]





def insert_answers(folder):

    temp_dataset=[]
    for q,a in folder.dataset:
        for answer in folder.list_of_answers:
            #If answer is more than one word
            if len(answer.split())>1:
                answer_split=answer.split('(')
                #Only use the first part of split and strip it of whitespaces
                answer_split=answer_split[0].strip()
                answer_=answer.replace(' ','_')
                q=q.replace(answer_split,answer_)
        temp_dataset.append([q,a])

    folder.dataset=temp_dataset

正如你所看到的那样,这是非常缓慢的,因为我有大约43.5万个问题句子 和list_of_answers中的几千个答案

我需要q,一对保持在一起。

我将对大约144个处理内核进行多处理以使其更快,但我想找到更快的算法。

示例输入:

questions=['pablo picasso painted guernica and random occurence of andy warhol so the question makes sense','andy warhol was born on ...']
list_of_answers=['pablo picasso','andy warhol (something)']

输出:

questions=['pablo_picasso painted guernica and random occurence of andy_warhol_(something) so the question makes sense','andy_warhol_(something) was born on ...']

1 个答案:

答案 0 :(得分:1)

这是使用正则表达式的直接实现。它解决了您的示例测试用例,但我不确定它对您的大型实际数据有多高效。还没有处理重叠匹配(但是),但你还没有说明如何处理这些匹配。

测试用例:

questions=['pablo picasso painted guernica and random occurence of andy warhol so the question makes sense','andy warhol was born on ...']
list_of_answers=['pablo picasso','andy warhol']
desired = ['pablo_picasso painted guernica and random occurence of andy_warhol so the question makes sense','andy_warhol was born on ...']

解决方案:

import re
finder = r'\b(' + '|'.join(list_of_answers) + r')\b'
def underscorer(match):
    return match.group().replace(' ', '_')
output = [re.sub(finder, underscorer, question) for question in questions]

测试:

>>> output == desired
True