我有一个包含许多单词(100.000+)的列表,我想要做的是删除列表中每个单词的所有子串。
为简单起见,我们假设我有以下列表:
words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']
以下输出是所需的:
['Hello', 'Apple', 'Banana', 'Peter']
'Hell'
已被删除,因为它是'Hello'
'Ban'
已被删除,因为它是'Banana'
'P'
已被删除,因为它是'Peter'
'e'
已被删除,因为它是'Hello'
,'Hell'
的子字符串,
'Apple'
,等等。我做了什么
这是我的代码,但我想知道是否有比这些嵌套理解更有效的方法。
to_remove = [x for x in words for y in words if x != y and x in y]
output = [x for x in words if x not in to_remove]
如何改善表现?我应该使用regex
吗?
答案 0 :(得分:2)
首先构建所有(唯一)子串的集合,然后用它过滤单词:
def substrings(s):
length = len(s)
return {s[i:j + 1] for i in range(length) for j in range(i, length)} - {s}
def remove_substrings(words):
subs = set()
for word in words:
subs |= substrings(word)
return set(w for w in words if w not in subs)
答案 1 :(得分:2)
@wim是正确的。
给定固定长度的字母表,以下算法在文本的总长度上是线性的。如果字母表具有无限大小,那么它将是O(n log(n))
。无论哪种方式,它都优于O(n^2)
。
Create an empty suffix tree T.
Create an empty list filtered_words
For word in words:
if word not in T:
Build suffix tree S for word (using Ukkonen's algorithm)
Merge S into T
append word to filtered_words
答案 2 :(得分:-1)
您可以按长度对数据进行排序,然后使用列表解析:
['Banana', 'Hello', 'Apple', 'Peter']
输出:
// List of Patterns
String wDec = "((\\d+)\\.(\\d+))\\'"; // 12.5'
String numberWithDoubleQuotes = "^(\\d+)\\\""; // 11"
String inchesWithForwardDash = "(\\d+)\\/(\\d+)\\\""; // 3/16"
// Spaces may or may not be used between the feet and inches and the inches and
// 16ths
String feetSQSpaceInchesDQ = "(\\d+)\\'(\\s)?(\\d+)\\\""; // 11' 11" OR 11'11"
// Dashes may or may not be used between feet and inches or between inches and
// 16ths or both
String wDash = "(\\d+)\\'(\\-)?(\\d+)\\\""; // 12'-11"
String wSpacesForwardDash = "(\\d+)\\'\\s+(\\d+)\\s((\\d+)\\/(\\d+))\\\""; // 12' 11 3/16"
String wSpacesDashForwardDash = "(\\d+)\\'\\s+(\\d+)\\-((\\d+)\\/(\\d+))\\\""; // 12' 11-1/2"
// Any number of spaces may be used between the feet and inches and the inches
// and 16ths
String multipleSpaceForwardDash = "(\\d+)\\'\\s+(\\d+)\\s+((\\d+)\\/(\\d+))\\\""; // 12' 11 1/2"
// An alternate simpler format using only a contiguous (no spaces) string of
// digits is also common
String threeGroupContiguous = "(\\d{2})(\\d{2})(\\d{2})"; // 121103
String twoGroupContiguous = "^(\\d{2})(\\d{2})"; // 1103
String oneGroupContiguous = "^(\\d{2})\\b"; // 03
List<Pattern> patterns = new ArrayList<>();
patterns.add(Pattern.compile(wDec));
patterns.add(Pattern.compile(numberWithDoubleQuotes));
patterns.add(Pattern.compile(inchesWithForwardDash));
patterns.add(Pattern.compile(feetSQSpaceInchesDQ));
patterns.add(Pattern.compile(wDash));
patterns.add(Pattern.compile(wSpacesForwardDash));
patterns.add(Pattern.compile(wSpacesDashForwardDash));
patterns.add(Pattern.compile(multipleSpaceForwardDash));
patterns.add(Pattern.compile(threeGroupContiguous));
patterns.add(Pattern.compile(twoGroupContiguous));
patterns.add(Pattern.compile(oneGroupContiguous));
答案 3 :(得分:-1)
请注意,在python中使用for
通常很慢(你可以使用numpy数组或NLP包),除此之外,它是怎么回事:
words = list(set(words))#elimnate dublicates
str_words = str(words)
r=[]
for x in words:
if str_words.find(x)!=str_words.rfind(x):continue
else:r.append(x)
print(r)
正如我在这里回答的那样,我不明白为什么c ++不会成为一个选择