Question

我有一个包含许多单词（100.000+）的列表，我想要做的是删除列表中每个单词的所有子串。

为简单起见，我们假设我有以下列表：

words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']

以下输出是所需的：

['Hello', 'Apple', 'Banana', 'Peter']

'Hell'已被删除，因为它是'Hello'
'Ban'已被删除，因为它是'Banana'
'P'已被删除，因为它是'Peter'
'e'已被删除，因为它是'Hello'，'Hell'的子字符串， 'Apple'，等等。

我做了什么

这是我的代码，但我想知道是否有比这些嵌套理解更有效的方法。

to_remove = [x for x in words for y in words if x != y and x in y]
output = [x for x in words if x not in to_remove]

如何改善表现？我应该使用regex吗？

Answer 1

首先构建所有（唯一）子串的集合，然后用它过滤单词：

def substrings(s):
    length = len(s)
    return {s[i:j + 1] for i in range(length) for j in range(i, length)} - {s}


def remove_substrings(words):
    subs = set()
    for word in words:
        subs |= substrings(word)

    return set(w for w in words if w not in subs)

Answer 2

@wim是正确的。

给定固定长度的字母表，以下算法在文本的总长度上是线性的。如果字母表具有无限大小，那么它将是O(n log(n))。无论哪种方式，它都优于O(n^2)。

Create an empty suffix tree T.
Create an empty list filtered_words
For word in words:
    if word not in T:
        Build suffix tree S for word (using Ukkonen's algorithm)
        Merge S into T
        append word to filtered_words

Answer 3

您可以按长度对数据进行排序，然后使用列表解析：

['Banana', 'Hello', 'Apple', 'Peter']

输出：

        // List of Patterns
    String wDec = "((\\d+)\\.(\\d+))\\'"; // 12.5'
    String numberWithDoubleQuotes = "^(\\d+)\\\""; // 11"
    String inchesWithForwardDash = "(\\d+)\\/(\\d+)\\\""; // 3/16"

    // Spaces may or may not be used between the feet and inches and the inches and
    // 16ths
    String feetSQSpaceInchesDQ = "(\\d+)\\'(\\s)?(\\d+)\\\""; // 11' 11" OR 11'11"

    // Dashes may or may not be used between feet and inches or between inches and
    // 16ths or both
    String wDash = "(\\d+)\\'(\\-)?(\\d+)\\\""; // 12'-11"
    String wSpacesForwardDash = "(\\d+)\\'\\s+(\\d+)\\s((\\d+)\\/(\\d+))\\\""; // 12' 11 3/16"
    String wSpacesDashForwardDash = "(\\d+)\\'\\s+(\\d+)\\-((\\d+)\\/(\\d+))\\\""; // 12' 11-1/2"

    // Any number of spaces may be used between the feet and inches and the inches
    // and 16ths
    String multipleSpaceForwardDash = "(\\d+)\\'\\s+(\\d+)\\s+((\\d+)\\/(\\d+))\\\""; // 12' 11 1/2"

    // An alternate simpler format using only a contiguous (no spaces) string of
    // digits is also common
    String threeGroupContiguous = "(\\d{2})(\\d{2})(\\d{2})"; // 121103
    String twoGroupContiguous = "^(\\d{2})(\\d{2})"; // 1103
    String oneGroupContiguous = "^(\\d{2})\\b"; // 03

    List<Pattern> patterns = new ArrayList<>();
    patterns.add(Pattern.compile(wDec));
    patterns.add(Pattern.compile(numberWithDoubleQuotes));
    patterns.add(Pattern.compile(inchesWithForwardDash));
    patterns.add(Pattern.compile(feetSQSpaceInchesDQ));
    patterns.add(Pattern.compile(wDash));
    patterns.add(Pattern.compile(wSpacesForwardDash));
    patterns.add(Pattern.compile(wSpacesDashForwardDash));
    patterns.add(Pattern.compile(multipleSpaceForwardDash));
    patterns.add(Pattern.compile(threeGroupContiguous));
    patterns.add(Pattern.compile(twoGroupContiguous));
    patterns.add(Pattern.compile(oneGroupContiguous));

Answer 4

请注意，在python中使用for通常很慢（你可以使用numpy数组或NLP包），除此之外，它是怎么回事：

words = list(set(words))#elimnate dublicates
str_words = str(words)
r=[]
for x in words:
    if str_words.find(x)!=str_words.rfind(x):continue
    else:r.append(x)
print(r)

正如我在这里回答的那样，我不明白为什么c ++不会成为一个选择

删除列表中的子串，其复杂度优于O（n ^ 2）

4 个答案: