Question

问题

给出一个字符串列表，从列表中找到给定文本中出现的字符串。

示例

list = ['red', 'hello', 'how are you', 'hey', 'deployed']
text = 'hello, This is shared right? how are you doing tonight'
result = ['red', 'how are you', 'hello']

“红色”，因为它已“共享”，因此将“红色”作为子字符串

这与this question非常相似，除了我们需要查找的单词也可以是子字符串。
列表很大，并且随着用户的增加而增加，而文本的长度几乎相同。
我当时正在考虑找到一种解决方案，其中时间复杂度取决于文本的长度而不是单词列表，因此即使添加了很多用户，它也可以扩展。

解决方案

我从单词给出列表中构建了一个
在文本上运行dfs并对照trie检查当前单词

伪代码

def FindWord (trie, text, word_so_far, index):
    index > len(text)
        return
    //Check if the word_so_far is a prefix of a key; if not return
    if trie.has_subtrie(word) == false:
       return 
    //Check if the word_so_far is a key; if ye add to result and look further 
    if trie.has_key(word) == false:
        // Add to result and continue
    //extend the current word we are searching
    FindWord (trie, text, word_so_far + text[index], index + 1)
    //start new from the next index 
    FindWord (trie, text, "", index + 1)

问题是，尽管运行时现在取决于len(text)，但在构建特里树后，它以时间复杂度O(2^n)运行，这对于多个文本来说是一次性的，所以很好。

我也看不到任何重叠的子问题来记忆和改善运行时间。

您能建议我以任何方式实现依赖于给定文本的运行时，而不是可以进行每个处理和缓存的单词列表的方法，并且这样做速度更快。

Answer 1

理论上您尝试做的事情的版本称为Aho--Corasick。实现后缀链接有点复杂，这是IIRC，所以这是一个仅使用Trie的算法。

我们逐个字母地消耗文本。在任何时候，我们都会在遍历中维护一组节点。最初，此集合仅由根节点组成。对于每个字母，我们遍历集合中的节点，如果可能的话，通过新字母降序。如果结果节点匹配，则报告。无论如何，将其放在下一组中。下一组还包含根节点，因为我们可以随时开始新的匹配。

这是我尝试在Python中快速实现（未经测试，不提供保修等）。

class Trie:
    def __init__(self):
        self.is_needle = False
        self._children = {}

    def find(self, text):
        node = self
        for c in text:
            node = node._children.get(c)
            if node is None:
                break
        return node

    def insert(self, needle):
        node = self
        for c in needle:
            node = node._children.setdefault(c, Trie())
        node.is_needle = True


def count_matches(needles, text):
    root = Trie()
    for needle in needles:
        root.insert(needle)
    nodes = [root]
    count = 0
    for c in text:
        next_nodes = [root]
        for node in nodes:
            next_node = node.find(c)
            if next_node is not None:
                count += next_node.is_needle
                next_nodes.append(next_node)
        nodes = next_nodes
    return count


print(
    count_matches(['red', 'hello', 'how are you', 'hey', 'deployed'],
                  'hello, This is shared right? how are you doing tonight'))

Answer 2

如果您想要一个更快的代码，该代码取决于文本窗口，则可以进行集合查找以加快处理速度。如果可行，请将查找列表更改为一组，然后在文本中找到所有可能用于查找的窗口。

dnorm_mix <- function(x, weights, means, sds) {
  value <- 0
  for (i in 1:length(weights)) {value <- value + weights[i]*dnorm(x, mean = means[i], sd = sds[i])}
  return(value)
}

Answer 3

嗯，这个呢？简单易懂：d

import re

def WordsInText(text, words):
    found = []
    for index, item in enumerate(words):
        if (re.search(f'{item}', text, re.IGNORECASE) is not None):
            found.append(item)
    return found

Answer 4

扩展@David Eisenstat建议以使用aho-corasick的算法来实现这一点。我找到了一个可以做到这一点的简单的python模块（pyahocorasic）。

这是问题中给出的示例的代码。

import ahocorasick

def find_words(list_words, text):
    A = ahocorasick.Automaton()

    for key in list_words:
      A.add_word(key, key)

    A.make_automaton()

    result = []
    for end_index, original_value in A.iter(text):
      result.append(original_value)

    return result

list_words = ['red', 'hello', 'how are you', 'hey', 'deployed']
text = 'hello, This is shared right? how are you doing tonight'
print(find_words(list_words, text))

Or run it online

给定一个单词列表和一个句子，找到所有出现在句子中的全部单词或全部或作为子字符串

4 个答案: