通过根据单词列表替换单词来复制句子的最快方法是什么?

时间:2017-12-15 14:11:19

标签: python

我有一个单词列表和一些我需要复制的句子:

wordlist_dict = {
    'class1': ['word_a', 'word_b', 'word_c'],
    'class2': ['word_d', 'word_e'],
    'class3': ['word_f', 'word_g', 'word_h', 'word_i', 'word_a']
}

sent_list = [
    "I have a sentence with word_g",
    "And another sentence with word_d",
    "Don't forget word_b",
    "no keyword here",
    "Last sentence with word_c and word_e"
]

我的预期结果是:

I have a sentence with word_f
I have a sentence with word_h
I have a sentence with word_i
I have a sentence with word_a
And another sentence with word_e
Don't forget word_a
Don't forget word_c
Last sentence with word_a and word_d
Last sentence with word_a and word_e
Last sentence with word_b and word_d
Last sentence with word_b and word_e
Last sentence with word_c and word_d

这是我的方法:

import re

pattern_list = []
pattern_all = ''
wordlist = sorted(wordlist_dict.values())
for v in wordlist:
    pattern_list.append('({})+'.format('|'.join(v)))
    pattern_all += '|' + '|'.join(v)
pattern_all = '({})+'.format(pattern_all[1:])
print(pattern_list)
# ['(word_a|word_b|word_c)+', '(word_d|word_e)+', '(word_f|word_g|word_h|word_i)+']
print(pattern_all)
# (word_a|word_b|word_c|word_d|word_e|word_f|word_g|word_h|word_i)+

new_sent_list = []
for sent in sent_list:
    match_list = re.findall(pattern_all, sent)
    print(match_list)
    if match_list:
        for match in match_list:
            for i in range(len(pattern_list)):
                if re.search(pattern_list[i], sent):
                    if match in wordlist[i]:
                        match_wordlist = wordlist[i]
                        match_wordlist.remove(match)
                        for word in match_wordlist:
                            new_sent_list.append(sent.replace(match, word))
                    else:
                        continue

我想知道是否有更高效的方法可以做到这一点,因为我的单词列表和句子列表比示例中的要大得多。提前谢谢。

更新:我刚刚意识到有多个词属于多个班级,而且句子中有多个关键词,所以我的代码现在还不行。

3 个答案:

答案 0 :(得分:1)

首先,您可以将wordlist_dict“反转”为将字词映射到其类的字典。在这里,我假设每个单词只在一个类中,否则它会变得更复杂。

wordclass_dict = {w: c for c in wordlist_dict for w in wordlist_dict[c]}

接下来,您可以找到所有单词的出现次数,使用pattern到(a)获取所有单词类,以及(b)创建用于重新格式化单词的模板。请注意,我将模式包装到单词边界\b中,因此它与单词片段不匹配。

pattern = r"\b(" + "|".join(wordclass_dict) + r")\b"
classes = [wordclass_dict[c] for c in re.findall(pattern, sentence)]
template = re.sub(pattern, "{}", sentence)

现在,您可以迭代所有可能替换的product并替换它们:

for prod in itertools.product(*(wordlist_dict[c] for c in classes)):
    print(template.format(*prod))

这样,句子"And another sentence with word_a and word_d"的结果是:

And another sentence with word_a and word_d
And another sentence with word_a and word_e
And another sentence with word_b and word_d
And another sentence with word_b and word_e
And another sentence with word_c and word_d
And another sentence with word_c and word_e

这应该比你的方法快得多(尽管没有时间),因为它只搜索pattern两次,而你分别搜索每个单独的模式。此外,这适用于具有多个占位符字的句子。

如果单词 可以在多个班级中,您可以使用:

wordclass_dict = collections.defaultdict(list)
for c in wordlist_dict:
    for w in wordlist_dict[c]:
        wordclass_dict[w].append(c)

# pattern, classes, template as above

for prod in itertools.product(*([w for c in cls for w in wordlist_dict[c]] 
                                for cls in classes)):
    print(template.format(*prod))

还可以<{1}} extend wordclass_dict条目中包含所有单词,而不是类名,使product更加简单,但是价格可能更高的空间要求,取决于单词类的大小和“重叠”。

答案 1 :(得分:0)

这是一个实现以下想法的替代版本:具有反向词典“word - &gt; class”,用于快速查找。这假定映射是可反转的。然后,启动pg_dump以打印单词类中所有其他单词的替换。

replace()

答案 2 :(得分:0)

你可以试试这个:

import re
wordlist_dict = {
'class1': ['word_a', 'word_b', 'word_c'],
'class2': ['word_d', 'word_e'],
'class3': ['word_f', 'word_g', 'word_h', 'word_i']
}

sent_list = [
  "I have a sentence with word_g",
  "And another sentence with word_d",
  "Don't forget word_b",
  "no key word here"
]
final_data = [filter(lambda x:x!=''.join(re.findall('(?<=\s)[a-zA-Z]+_[a-zA-Z]+$', i)), [c for a, c in wordlist_dict.items() if any(h.endswith(''.join(re.findall('(?<=\s)[a-zA-Z]+_[a-zA-Z]+$', i))) for h in c)][0]) for i in sent_list]
new_final_data = [a for i, a in enumerate(final_data) if not any(c in d for d in final_data[:i] for c in a)]
second_final_data = reduce(lambda x, y:x+y, [[a[:-6]+b for b in c] for a, c in zip(sent_list, new_final_data)])

输出:

['I have a sentence with word_f', 'I have a sentence with word_h', 'I have a sentence with word_i', 'And another sentence with word_e', "Don't forget word_a", "Don't forget word_c"]