我有一个单词列表和一些我需要复制的句子:
wordlist_dict = {
'class1': ['word_a', 'word_b', 'word_c'],
'class2': ['word_d', 'word_e'],
'class3': ['word_f', 'word_g', 'word_h', 'word_i', 'word_a']
}
sent_list = [
"I have a sentence with word_g",
"And another sentence with word_d",
"Don't forget word_b",
"no keyword here",
"Last sentence with word_c and word_e"
]
我的预期结果是:
I have a sentence with word_f
I have a sentence with word_h
I have a sentence with word_i
I have a sentence with word_a
And another sentence with word_e
Don't forget word_a
Don't forget word_c
Last sentence with word_a and word_d
Last sentence with word_a and word_e
Last sentence with word_b and word_d
Last sentence with word_b and word_e
Last sentence with word_c and word_d
这是我的方法:
import re
pattern_list = []
pattern_all = ''
wordlist = sorted(wordlist_dict.values())
for v in wordlist:
pattern_list.append('({})+'.format('|'.join(v)))
pattern_all += '|' + '|'.join(v)
pattern_all = '({})+'.format(pattern_all[1:])
print(pattern_list)
# ['(word_a|word_b|word_c)+', '(word_d|word_e)+', '(word_f|word_g|word_h|word_i)+']
print(pattern_all)
# (word_a|word_b|word_c|word_d|word_e|word_f|word_g|word_h|word_i)+
new_sent_list = []
for sent in sent_list:
match_list = re.findall(pattern_all, sent)
print(match_list)
if match_list:
for match in match_list:
for i in range(len(pattern_list)):
if re.search(pattern_list[i], sent):
if match in wordlist[i]:
match_wordlist = wordlist[i]
match_wordlist.remove(match)
for word in match_wordlist:
new_sent_list.append(sent.replace(match, word))
else:
continue
我想知道是否有更高效的方法可以做到这一点,因为我的单词列表和句子列表比示例中的要大得多。提前谢谢。
更新:我刚刚意识到有多个词属于多个班级,而且句子中有多个关键词,所以我的代码现在还不行。
答案 0 :(得分:1)
首先,您可以将wordlist_dict
“反转”为将字词映射到其类的字典。在这里,我假设每个单词只在一个类中,否则它会变得更复杂。
wordclass_dict = {w: c for c in wordlist_dict for w in wordlist_dict[c]}
接下来,您可以找到所有单词的出现次数,使用pattern
到(a)获取所有单词类,以及(b)创建用于重新格式化单词的模板。请注意,我将模式包装到单词边界\b
中,因此它与单词片段不匹配。
pattern = r"\b(" + "|".join(wordclass_dict) + r")\b"
classes = [wordclass_dict[c] for c in re.findall(pattern, sentence)]
template = re.sub(pattern, "{}", sentence)
现在,您可以迭代所有可能替换的product
并替换它们:
for prod in itertools.product(*(wordlist_dict[c] for c in classes)):
print(template.format(*prod))
这样,句子"And another sentence with word_a and word_d"
的结果是:
And another sentence with word_a and word_d
And another sentence with word_a and word_e
And another sentence with word_b and word_d
And another sentence with word_b and word_e
And another sentence with word_c and word_d
And another sentence with word_c and word_e
这应该比你的方法快得多(尽管没有时间),因为它只搜索pattern
两次,而你分别搜索每个单独的模式。此外,这适用于具有多个占位符字的句子。
如果单词 可以在多个班级中,您可以使用:
wordclass_dict = collections.defaultdict(list)
for c in wordlist_dict:
for w in wordlist_dict[c]:
wordclass_dict[w].append(c)
# pattern, classes, template as above
for prod in itertools.product(*([w for c in cls for w in wordlist_dict[c]]
for cls in classes)):
print(template.format(*prod))
你还可以<{1}} extend
wordclass_dict
条目中包含所有单词,而不是类名,使product
更加简单,但是价格可能更高的空间要求,取决于单词类的大小和“重叠”。
答案 1 :(得分:0)
这是一个实现以下想法的替代版本:具有反向词典“word - &gt; class”,用于快速查找。这假定映射是可反转的。然后,启动pg_dump
以打印单词类中所有其他单词的替换。
replace()
答案 2 :(得分:0)
你可以试试这个:
import re
wordlist_dict = {
'class1': ['word_a', 'word_b', 'word_c'],
'class2': ['word_d', 'word_e'],
'class3': ['word_f', 'word_g', 'word_h', 'word_i']
}
sent_list = [
"I have a sentence with word_g",
"And another sentence with word_d",
"Don't forget word_b",
"no key word here"
]
final_data = [filter(lambda x:x!=''.join(re.findall('(?<=\s)[a-zA-Z]+_[a-zA-Z]+$', i)), [c for a, c in wordlist_dict.items() if any(h.endswith(''.join(re.findall('(?<=\s)[a-zA-Z]+_[a-zA-Z]+$', i))) for h in c)][0]) for i in sent_list]
new_final_data = [a for i, a in enumerate(final_data) if not any(c in d for d in final_data[:i] for c in a)]
second_final_data = reduce(lambda x, y:x+y, [[a[:-6]+b for b in c] for a, c in zip(sent_list, new_final_data)])
输出:
['I have a sentence with word_f', 'I have a sentence with word_h', 'I have a sentence with word_i', 'And another sentence with word_e', "Don't forget word_a", "Don't forget word_c"]