我有成千上万的同义词列表。我还有成千上万的文档要搜索这些术语。使用python(或伪代码)这是一种有效的方法吗?
# this would work for single word synonyms, but there are multiple word synonyms too
synonymSet = set([...])
wordsInDocument = set([...])
synonymsInDocument = synonymSet.intersection(wordsInDocument)
# this would work, but sounds slow
matches = []
for document in documents:
for synonym in synonymSet:
if synonym in document:
matches.append(synonym)
这个问题有一个很好的解决方案,还是只需要一段时间? 提前谢谢
答案 0 :(得分:0)
如何从同义词列表构建正则表达式:
import re
pattern = "|".join(synonymList)
regex = re.compile(pattern)
matches = regex.findall(document) # get a list of the matched synonyms
matchedSynonyms = set(matches) # eliminate duplicates using a set