我经常使用sorted
和groupby
来查找可迭代中的重复项。现在我觉得它不可靠:
from itertools import groupby
data = 3 * ('x ', (1,), u'x')
duplicates = [k for k, g in groupby(sorted(data)) if len(list(g)) > 1]
print duplicates
# [] printed - no duplicates found - like 9 unique values
解释了上面的代码在Python 2.x中失败的原因here。
什么是可靠的pythonic方法来查找重复项?
我在SO上寻找类似的问题/答案。其中最好的是“In Python, how do I take a list and reduce it to a list of duplicates?”,但是接受的解决方案不是pythonic(它是...的程序多行,如果...添加...其他...添加...返回结果)和其他解决方案不可靠(取决于“<”运算符的未实现的传递性)或缓慢(O n * n)。
[编辑] 已关闭。接受的答案帮助我在下面的答案中总结了更为一般性的结论。
我喜欢使用内置类型来代表例如树的结构。这就是为什么我现在害怕混合。
答案 0 :(得分:11)
注意:假设条目可以播放
>>> from collections import Counter
>>> data = 3 * ('x ', (1,), u'x')
>>> [k for k, c in Counter(data).iteritems() if c > 1]
[u'x', 'x ', (1,)]
答案 1 :(得分:1)
结论:
groupby(sorted(..))
在条件下非常好
Counter(map(pickled.dumps, data))
代替Counter(data)
,最后将其取消或groupby(sorted(data, key=pickled.dumps))
如果不想要破坏或没有python 2.7 其他问题中的所有其他解决方案目前更糟。
注意:
我考虑过按类型预先分配项目或者通过哈希为可散列项目扩展它们,这对当前有帮助,但它不是一个安全的解决方案,因为同样的问题可能是“<”运算符insite列表,元组等。
答案 2 :(得分:1)
这个主题对我很有意思,所以我将上述解决方案与其他主题中接受的解决方案进行了对比。
Counter方法是这个线程非常优雅;然而,在这个帖子In Python, how do I take a list and reduce it to a list of duplicates?中接受的答案似乎要快2倍。
import random as rn
import timeit
from collections import Counter
a = [rn.randint(0,100000) for i in xrange(10000)]
def counter_way(x):
return [k for k,v in Counter(x).iteritems() if v > 1]
def accepted_way(x): #accepted answer in the linked thread
duplicates = set()
found = set()
for item in x:
if item in found:
duplicates.add(item)
else:
found.add(item)
return duplicates
t1 = timeit.timeit('counter_way(a)', 'from __main__ import counter_way, a', number = 100)
print "counter_way: ", t1
t2 = timeit.timeit('accepted_way(a)','from __main__ import accepted_way, a', number = 100)
print "accepted_way: ", t2
结果:
counter_way: 1.15775845813
accepted_way: 0.531060022992
我在不同的规格下试过这个,结果总是一样的。
答案 3 :(得分:0)
然而,两种解决方案都存在缺陷。原因是它将值与相同的哈希值合并。因此,它取决于使用的值是否可能具有相同的哈希值。你可能认为这不是那种疯狂的评论(我之前也感到惊讶),因为Python以特殊方式散列了一些价值观。尝试:
from collections import Counter
def counter_way(x):
return [k for k,v in Counter(x).iteritems() if v > 1]
def accepted_way(x): #accepted answer in the linked thread
duplicates = set()
found = set()
for item in x:
if item in found:
duplicates.add(item)
else:
found.add(item)
return duplicates
a = ('x ', (1,), u'x') * 2
print 'The values:', a
print 'Counter way duplicates:', counter_way(a)
print 'Accepted way duplicates:', accepted_way(a)
print '-' * 50
# Now the problematic values.
a = 2 * (0, 1, True, False, 0.0, 1.0)
print 'The values:', a
print 'Counter way duplicates:', counter_way(a)
print 'Accepted way duplicates:', accepted_way(a)
根据定义,1,1.0和True具有相同的哈希值,类似地,0,0.0和False。它在我的控制台上打印以下内容(考虑最后两行 - 所有值实际上应该是重复的):
c:\tmp\___python\hynekcer\so10247815>python d.py
The values: ('x ', (1,), u'x', 'x ', (1,), u'x')
Counter way duplicates: [u'x', 'x ', (1,)]
Accepted way duplicates: set([u'x', 'x ', (1,)])
--------------------------------------------------
The values: (0, 1, True, False, 0.0, 1.0, 0, 1, True, False, 0.0, 1.0)
Counter way duplicates: [0, 1]
Accepted way duplicates: set([False, True])
答案 4 :(得分:0)
仅仅因为我很好奇,这里的解决方案在0,False,0.0等之间产生差异。它基于使用my_cmp
对序列进行排序,同时还考虑了项目的类型。当然,与上述解决方案相比,它非常慢。这是因为排序。但比较结果!
import sys
import timeit
from collections import Counter
def empty(x):
return
def counter_way(x):
return [k for k,v in Counter(x).iteritems() if v > 1]
def accepted_way(x): #accepted answer in the linked thread
duplicates = set()
found = set()
for item in x:
if item in found:
duplicates.add(item)
else:
found.add(item)
return duplicates
def my_cmp(a, b):
result = cmp(a, b)
if result == 0:
return cmp(id(type(a)), id(type(b)))
return result
def duplicates_via_sort_with_types(x, my_cmp=my_cmp):
last = '*** the value that cannot be in the sequence by definition ***'
duplicates = []
added = False
for e in sorted(x, cmp=my_cmp):
if my_cmp(e, last) == 0:
##print 'equal:', repr(e), repr(last), added
if not added:
duplicates.append(e)
##print 'appended:', e
added = True
else:
##print 'different:', repr(e), repr(last), added
last = e
added = False
return duplicates
a = [0, 1, True, 'a string', u'a string', False, 0.0, 1.0, 2, 2.0, 1000000, 1000000.0] * 1000
print 'Counter way duplicates:', counter_way(a)
print 'Accepted way duplicates:', accepted_way(a)
print 'Via enhanced sort:', duplicates_via_sort_with_types(a)
print '-' * 50
# ... and the timing
t3 = timeit.timeit('empty(a)','from __main__ import empty, a', number = 100)
print "empty: ", t3
t1 = timeit.timeit('counter_way(a)', 'from __main__ import counter_way, a', number = 100)
print "counter_way: ", t1
t2 = timeit.timeit('accepted_way(a)','from __main__ import accepted_way, a', number = 100)
print "accepted_way: ", t2
t4 = timeit.timeit('duplicates_via_sort_with_types(a)','from __main__ import duplicates_via_sort_with_types, a', number = 100)
print "duplicates_via_sort_with_types: ", t4
它在我的控制台上打印:
c:\tmp\___python\hynekcer\so10247815>python e.py
Counter way duplicates: [0, 1, 2, 'a string', 1000000]
Accepted way duplicates: set([False, True, 2.0, u'a string', 1000000.0])
Via enhanced sort: [False, 0.0, 0, True, 1.0, 1, 2.0, 2, 1000000.0, 1000000, 'a string', u'a string']
--------------------------------------------------
empty: 2.11195471969e-05
counter_way: 0.76977053612
accepted_way: 0.496547434023
duplicates_via_sort_with_types: 11.2378848197