我正在寻找在列表中查找重复序列的最佳方法。 序列被定义为至少两个相邻值。
示例:在以下列表中,应标识并删除重复序列。
a = [45874,
35195, # <-
28965,
05867,
25847, # <-
94937,
64894,
55535,
62899,
00391,
35195, # Duplicate Sequence Start
28965,
05867,
25847, # Duplicate Sequence End
08483,
55801,
33129,
42616]
我无法绕过任何解决方案,所以非常感谢任何帮助!
答案 0 :(得分:5)
在长度为 n 的字符串中查找长度为 m 的单个子序列,可以在不小于 O(nm)的情况下使用{ {3}}。找到所有重复的子序列很可能不止于此。虽然,如果你想要的只是删除子序列而不是找到它们,那就有一个技巧。
我们只需要关注长度为2的子序列,因为任何序列都可以表示为长度为2的重叠子序列。这种观察允许我们保持这些对的计数,然后从右到左删除它们。
特别是,这需要遍历列表一段固定的时间,因此 O(n)。
from collections import Counter
def remove_duplicates(lst):
dropped_indices = set()
counter = Counter(tuple(lst[i:i+2]) for i in range(len(lst) - 1))
for i in range(len(lst) - 2, -1, -1):
sub = tuple(lst[i:i+2])
if counter[sub] > 1:
dropped_indices |= {i, i + 1}
counter[sub] -= 1
return [x for i, x in enumerate(lst) if i not in dropped_indices]
以下是基于提供的输入的示例。
a = [45874,
35195, # This
28965, # Is
5867, # A
25847, # Subsequence
94937,
64894,
55535,
62899,
391,
35195, # Those
28965, # Should
5867, # Be
25847, # Removed
8483]
b = remove_duplicates(a)
# Output:
# [45874,
# 35195,
# 28965,
# 5867,
# 25847,
# 94937,
# 64894,
# 55535,
# 62899,
# 391,
# <- Here the duplicate was removed
# 8483]
答案 1 :(得分:0)
我使用以下方法管理,可能效率不高:
sequences=[] #all the sequences, as defined in question
duplicated=[] #all the duplicated sequences
for i in range(2, len(a)): #possible lengths of sequence
n=0 # index of start of sequence
for j in a:
if n+i<=len(a): #if it's not too close to the end
sequences.append(a[n:n+i]) #add the sequence
n+=1
tests=sequences[:] #duplicate so that I can remove for checking purposes
for i in sequences:
tests.remove(i)
if i in tests: #if it's in there twice
duplicated.append(i) #add it to the duplicates
for i in duplicated:
found=False #haven't found it yet
n=0 #index we're at
for j in a:
if a[n:n+len(i)]==i and not found: #if it matches
del a[n:n+len(i)] #remove it
found=True #don't remove it again
n+=1