所以这就是我需要做的事情。
愚蠢的版本:
在列表A中,用列表B的下划线版本替换列表B中每个出现的子字符串。
我有一个名为Folder()的类来保存数据。
class Folder():
dataset= [('question sentence', 'multiple word answer'),... n times]
list_of_answers=['answer','multiple_word_answer',... n times]
def insert_answers(folder):
temp_dataset=[]
for q,a in folder.dataset:
for answer in folder.list_of_answers:
#If answer is more than one word
if len(answer.split())>1:
answer_split=answer.split('(')
#Only use the first part of split and strip it of whitespaces
answer_split=answer_split[0].strip()
answer_=answer.replace(' ','_')
q=q.replace(answer_split,answer_)
temp_dataset.append([q,a])
folder.dataset=temp_dataset
正如你所看到的那样,这是非常缓慢的,因为我有大约43.5万个问题句子 和list_of_answers中的几千个答案
我需要q,一对保持在一起。
我将对大约144个处理内核进行多处理以使其更快,但我想找到更快的算法。
示例输入:
questions=['pablo picasso painted guernica and random occurence of andy warhol so the question makes sense','andy warhol was born on ...']
list_of_answers=['pablo picasso','andy warhol (something)']
输出:
questions=['pablo_picasso painted guernica and random occurence of andy_warhol_(something) so the question makes sense','andy_warhol_(something) was born on ...']
答案 0 :(得分:1)
这是使用正则表达式的直接实现。它解决了您的示例测试用例,但我不确定它对您的大型实际数据有多高效。还没有处理重叠匹配(但是),但你还没有说明如何处理这些匹配。
测试用例:
questions=['pablo picasso painted guernica and random occurence of andy warhol so the question makes sense','andy warhol was born on ...']
list_of_answers=['pablo picasso','andy warhol']
desired = ['pablo_picasso painted guernica and random occurence of andy_warhol so the question makes sense','andy_warhol was born on ...']
解决方案:
import re
finder = r'\b(' + '|'.join(list_of_answers) + r')\b'
def underscorer(match):
return match.group().replace(' ', '_')
output = [re.sub(finder, underscorer, question) for question in questions]
测试:
>>> output == desired
True