我有一个关键字列表和另一个较长的字符串(2或3页)。我想找出存在于关键字列表中的关键字。 例如
Keywords = [k1, k2, k3 k4, k5, k6 k7 k8]
paragraphs = "This will be 2 to4 page article"
一种简单的方法将是
present_keywords = [x for x in keywords if x in paragraphs]
上述算法的时间复杂度为O(m*n) =~ O(n^2)
另一种方式
我可以创建堆关键字列表,时间复杂度:O(n log n)
然后从堆中的段落中搜索每个单词,时间复杂度将为O(n)
。
注意:关键字也是二元语法,三元语法,因此第二种方法将无效。
实现这一目标的有效方法是什么?
许多人都给出了解决方案,而没有考虑这种约束。例如纽约是一个关键字。拆分段落会将New和York拆分为不同的单词。在上面的注释中也提到了这一点。
答案 0 :(得分:4)
为减少时间复杂度,我们可以增加空间复杂度。假设每个关键字都是唯一的,请遍历keywords
并将其哈希到set()中(如果不是,则将删除重复项)。
然后,您可以遍历paragraph
并创建一个,两个或三个单词短语,检查它们的存在并在hashedKeywords
中出现任何这些短语时增加其计数。时间复杂度将为O(m + n)=〜O(n),但空间复杂度将从O(1)变为O(n)。
import string # for removing punctuation
# Sample input with bigrams and trigrams in keywords
paragraphs = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
keywords = ['magna', 'lorem ipsum', 'sed do eiusmod', 'aliqua']
# Hash keywords into set for faster look up
hashedKeywords = set()
for keyword in keywords:
hashedKeywords.add(keyword)
# Strip punctuation from paragraph phrases using translate() and make it case insensitive using lower()
table = str.maketrans({key: None for key in string.punctuation})
wordsInParagraphs = [w.translate(table).lower() for w in paragraphs.split()]
# Initialize for loop
maxGram = 3
wordFrequency = {}
# Loop through words in paragraphs but also create a small list of one, two, or three word phrases.
for i in range(len(wordsInParagraphs)):
# List slicing ensures the last word and second to last word will produce a one and two string list, respectively (since slicing past the length of the list will simply return a list up to the last element in Python)
phrases = wordsInParagraphs[i:i+maxGram] # e.g. ['lorem', 'ipsum', 'dolor']
# Loop through the one, two, and three word phrases and check if phrase is in keywords
for j in range(len(phrases)):
phrase = ' '.join(phrases[0:j+1]) # Join list of strings into a complete string e.g. 'lorem', 'lorem ipsum', and 'lorem ipsum dolor'
if phrase in hashedKeywords:
wordFrequency.setdefault(phrase , 0)
wordFrequency[phrase] += 1
print(wordFrequency)
输出:
{'lorem ipsum': 1, 'sed do eiusmod': 1, 'magna': 1, 'aliqua': 1}
注意:这是在Python 3中使用的。如果在Python 2中使用并希望删除标点符号,请参见this answer。