Question

我有一组查询，其中一些只是最终搜索字符串的一部分。我需要从很长的查询集合中清除部分字符串。在这样的数百万集中，这是一种快速的方法吗？

t = {u'house prices',
 u'how ',
 u'how man',
 u'how many animals go ex',
 u'how many animals go extinted eac',
 u'how many animals go extinted each ',
 u'how many species go',
 u'how many species go extin',
 u'how many species go extinet each yea',
 u'how many species go extinet each year?'}

我只想保留：

t = {u'house prices',
 u'how many species go extinet each year?',
 u'how many animals go extinted each '}

这里是来自@Alex Hall的解决方案，经过编辑以捕捉最终字符串（＆＃39; - + - ＆＃39;的连接）

# Print out the unique strings
q = sorted(list(t)) + ['-+-']
for i in range(len(q) - 1):
    if not q[i+1].startswith(q[i]):
        print i, q[i]

Answer 1

对集合进行排序以生成列表q，然后遍历它并构建一个新的元素列表not q[i+1].startswith(q[i])。应该合理地做到这一点。

Answer 2

编辑：Alex Hall的解决方案更好。

对于每个集合，创建一个新的trie并将所有集合的字符串插入其中。在生成的trie中，叶节点表示不是任何其他字符串的前缀的字符串。通过良好的trie实现，运行时期望在字符串长度的总和中呈线性。

根据部分字符串删除冗余字符串

2 个答案: