Question

给定一组短语，我想过滤包含任何其他短语的所有短语的集合。此处包含表示如果短语包含另一个短语的所有单词，则应将其过滤掉。短语中的单词顺序无关紧要。

到目前为止我所拥有的是：

按照每个短语中的单词数对集合进行排序。
对于集合中的每个短语X：
1. 对于集合其余部分中的每个短语Y：
  1. 如果X中的所有单词都在Y中，则X包含在Y中，丢弃Y.

鉴于大约10k个短语的列表，这很慢。有更好的选择吗？

Answer 1

您可以构建一个将单词映射到短语的索引，并执行以下操作：

let matched = set of all phrases
for each word in the searched phrase
    let wordMatch = all phrases containing the current word
    let matched = intersection of matched and wordMatch

在此之后，matched将包含与目标短语中的所有单词匹配的所有短语。通过将matched初始化为仅包含words[0]的所有短语集，然后仅迭代words[1..words.length]，可以对其进行优化。过滤短于目标短语的短语也可以提高性能。

除非我弄错了，否则一个简单的实现具有O(n·m)的最差情况复杂度（当搜索短语与所有短语匹配时），其中n是搜索短语中的单词数，并且m是短语的数量。

Answer 2

你的算法是短语数量的二次方，这可能会减慢它的速度。在这里，我通过单词对短语进行索引，以便在常见情况下低于二次方。

# build index
foreach phrase: foreach word: phrases[word] += phrase

# use index to filter out phrases that contain all the words
# from another phrase
foreach phrase:
  foreach word: 
     if first word:
        siblings = phrases[word]
     else
        siblings = siblings intersection phrases[word]
  # siblings now contains any phrase that has at least all our words
  remove each sibling from the output set of phrases  

# done!

Answer 3

这是找到一组集的最小值的问题。天真的算法和问题定义如下所示：

set(s for s in sets if not any(other < s for other in sets))

有这样的子二次算法（例如this），但考虑到N是10000，实现的效率可能更重要。最佳方法在很大程度上取决于输入数据的分布。鉴于输入集是大多数不同的自然语言短语，redtuna建议的方法应该运行良好。这是该算法的python实现。

from collections import defaultdict

def find_minimal_phrases(phrases):
    # Make the phrases hashable
    phrases = map(frozenset, phrases)

    # Create a map to find all phrases containing a word
    phrases_containing = defaultdict(set)
    for phrase in phrases:
        for word in phrase:
            phrases_containing[word].add(phrase)

    minimal_phrases = []
    found_superphrases = set()
    # in sorted by length order to find minimal sets first thanks to the
    # fact that a.superset(b) implies len(a) > len(b)
    for phrase in sorted(phrases, key=len):
        if phrase not in found_superphrases:
            connected_phrases = [phrases_containing[word] for word in phrase]
            connected_phrases.sort(key=len)
            superphrases = reduce(set.intersection, connected_phrases)
            found_superphrases.update(superphrases)
            minimal_phrases.append(phrase)
    return minimal_phrases

这仍然是二次方，但在我的机器上，它运行350毫秒，一组10k短语包含50％的最小值，并带有指数分布的单词。

Answer 4

按照其内容对短语进行排序，即'Z A' - ＆gt; 'A Z'，然后消除短语很容易从最短到较长。

用于过滤包含在其他短语中的所有短语的集合的算法

4 个答案: