Python-找到列表中出现次数最多的项目

时间:2011-08-08 19:10:18

标签: python list max counting

在Python中,我有一个列表:

L = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]  

我想识别出现次数最多的项目。我能够解决它,但我需要最快的方法来解决它。我知道有一个很好的Pythonic答案。

14 个答案:

答案 0 :(得分:91)

from collections import Counter
most_common,num_most_common = Counter(L).most_common(1)[0] # 4, 6 times

对于较旧的Python版本(< 2.7),您可以使用this receipe获取Counter类。

答案 1 :(得分:65)

我很惊讶没有人提到最简单的解决方案,max()使用密钥list.count

max(lst,key=lst.count)

示例:

>>> lst = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
>>> max(lst,key=lst.count)
4

这适用于Python 3或2,但请注意,它只返回最频繁的项目,而不是频率。此外,如果是 draw (即联合最频繁的项目),则只返回一个项目。

尽管使用max()的时间复杂度比使用Counter.most_common(1)作为PM 2Ring条评论更糟糕,但该方法可以从快速C实施中受益,我发现这种方法最快对于短列表,但对于较大的列表较慢(IPython 5.3中显示的Python 3.6时序):

In [1]: from collections import Counter
   ...: 
   ...: def f1(lst):
   ...:     return max(lst, key = lst.count)
   ...: 
   ...: def f2(lst):
   ...:     return Counter(lst).most_common(1)
   ...: 
   ...: lst0 = [1,2,3,4,3]
   ...: lst1 = lst0[:] * 100
   ...: 

In [2]: %timeit -n 10 f1(lst0)
10 loops, best of 3: 3.32 us per loop

In [3]: %timeit -n 10 f2(lst0)
10 loops, best of 3: 26 us per loop

In [4]: %timeit -n 10 f1(lst1)
10 loops, best of 3: 4.04 ms per loop

In [5]: %timeit -n 10 f2(lst1)
10 loops, best of 3: 75.6 us per loop

答案 2 :(得分:26)

在你的问题中,你问过最快的方法。正如反复证明的那样,特别是Python,直觉不是一个可靠的指南:你需要衡量。

这是对几种不同实现的简单测试:

import sys
from collections import Counter, defaultdict
from itertools import groupby
from operator import itemgetter
from timeit import timeit

L = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]

def max_occurrences_1a(seq=L):
    "dict iteritems"
    c = dict()
    for item in seq:
        c[item] = c.get(item, 0) + 1
    return max(c.iteritems(), key=itemgetter(1))

def max_occurrences_1b(seq=L):
    "dict items"
    c = dict()
    for item in seq:
        c[item] = c.get(item, 0) + 1
    return max(c.items(), key=itemgetter(1))

def max_occurrences_2(seq=L):
    "defaultdict iteritems"
    c = defaultdict(int)
    for item in seq:
        c[item] += 1
    return max(c.iteritems(), key=itemgetter(1))

def max_occurrences_3a(seq=L):
    "sort groupby generator expression"
    return max(((k, sum(1 for i in g)) for k, g in groupby(sorted(seq))), key=itemgetter(1))

def max_occurrences_3b(seq=L):
    "sort groupby list comprehension"
    return max([(k, sum(1 for i in g)) for k, g in groupby(sorted(seq))], key=itemgetter(1))

def max_occurrences_4(seq=L):
    "counter"
    return Counter(L).most_common(1)[0]

versions = [max_occurrences_1a, max_occurrences_1b, max_occurrences_2, max_occurrences_3a, max_occurrences_3b, max_occurrences_4]

print sys.version, "\n"

for vers in versions:
    print vers.__doc__, vers(), timeit(vers, number=20000)

我机器上的结果:

2.7.2 (v2.7.2:8527427914a2, Jun 11 2011, 15:22:34) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] 

dict iteritems (4, 6) 0.202214956284
dict items (4, 6) 0.208412885666
defaultdict iteritems (4, 6) 0.221301078796
sort groupby generator expression (4, 6) 0.383440971375
sort groupby list comprehension (4, 6) 0.402786016464
counter (4, 6) 0.564319133759

所以似乎Counter解决方案并不是最快的。并且,至少在这种情况下,groupby更快。 defaultdict很好,但你为了方便而付出了一点代价;将普通dictget一起使用会稍快一些。

如果列表更大,会发生什么?将L *= 10000添加到上面的测试中并将重复次数减少到200:

dict iteritems (4, 60000) 10.3451900482
dict items (4, 60000) 10.2988479137
defaultdict iteritems (4, 60000) 5.52838587761
sort groupby generator expression (4, 60000) 11.9538850784
sort groupby list comprehension (4, 60000) 12.1327362061
counter (4, 60000) 14.7495789528

现在defaultdict是明显的赢家。因此,'get'方法的成本和inplace add的损失可能会增加(对生成的代码的检查会留作练习)。

但是使用修改后的测试数据,唯一项目值的数量没有变化,因此可能dictdefaultdict在其他实现方面具有优势。那么如果我们使用更大的列表却会大幅增加独特项目的数量会怎样?用以下内容替换L的初始化:

LL = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]
L = []
for i in xrange(1,10001):
    L.extend(l * i for l in LL)

dict iteritems (2520, 13) 17.9935798645
dict items (2520, 13) 21.8974409103
defaultdict iteritems (2520, 13) 16.8289561272
sort groupby generator expression (2520, 13) 33.853593111
sort groupby list comprehension (2520, 13) 36.1303369999
counter (2520, 13) 22.626899004

所以现在Counter显然比groupby解决方案更快,但仍然比iteritemsdict的{​​{1}}版本慢。

这些例子的目的不是产生最佳解决方案。关键是通常没有一个最佳通用解决方案。另外还有其他性能标准。解决方案的内存要求会有很大差异,随着输入大小的增加,内存需求可能成为算法选择的首要因素。

底线:这完全取决于你需要衡量。

答案 3 :(得分:14)

这是一个适用于Python 2.5及更高版本的defaultdict解决方案:

from collections import defaultdict

L = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]
d = defaultdict(int)
for i in L:
    d[i] += 1
result = max(d.iteritems(), key=lambda x: x[1])
print result
# (4, 6)
# The number 4 occurs 6 times

请注意L = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 7, 7, 7, 7, 7, 56, 6, 7, 67] 那么有6个4和6个7。但是,结果将是(4, 6),即6个4。

答案 4 :(得分:2)

也许是most_common()方法

答案 5 :(得分:1)

我使用Python 3.5.2使用此函数从groupby模块获得itertools的最佳结果:

from itertools import groupby

a = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]

def occurrence():
    occurrence, num_times = 0, 0
    for key, values in groupby(a, lambda x : x):
        val = len(list(values))
        if val >= occurrence:
            occurrence, num_times =  key, val
    return occurrence, num_times

occurrence, num_times = occurrence()
print("%d occurred %d times which is the highest number of times" % (occurrence, num_times))

输出:

4 occurred 6 times which is the highest number of times

使用timeit模块中的timeit进行测试。

我使用此脚本进行number= 20000的测试:

from itertools import groupby

def occurrence():
    a = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
    occurrence, num_times = 0, 0
    for key, values in groupby(a, lambda x : x):
        val = len(list(values))
        if val >= occurrence:
            occurrence, num_times =  key, val
    return occurrence, num_times

if __name__ == '__main__':
    from timeit import timeit
    print(timeit("occurrence()", setup = "from __main__ import occurrence",  number = 20000))

输出(最好的):

0.1893607140000313

答案 6 :(得分:1)

没有任何库或集合的简单方法

def mcount(l):
  n = []                  #To store count of each elements
  for x in l:
      count = 0
      for i in range(len(l)):
          if x == l[i]:
              count+=1
      n.append(count)
  a = max(n)              #largest in counts list
  for i in range(len(n)):
      if n[i] == a:
          return(l[i],a)  #element,frequency
  return                  #if something goes wrong

答案 7 :(得分:0)

我想提出另一个看起来不错的解决方案,并且简短列表很快。

def mc(seq=L):
    "max/count"
    max_element = max(seq, key=seq.count)
    return (max_element, seq.count(max_element))

您可以使用Ned Deily提供的代码对此进行基准测试,该代码将为您提供最小测试用例的结果:

3.5.2 (default, Nov  7 2016, 11:31:36) 
[GCC 6.2.1 20160830] 

dict iteritems (4, 6) 0.2069783889998289
dict items (4, 6) 0.20462976200065896
defaultdict iteritems (4, 6) 0.2095775119996688
sort groupby generator expression (4, 6) 0.4473949929997616
sort groupby list comprehension (4, 6) 0.4367636879997008
counter (4, 6) 0.3618192010007988
max/count (4, 6) 0.20328268999946886

但要注意,它效率低下,因此大型列表真的慢!

答案 8 :(得分:0)

以下是我提出的解决方案,如果字符串中有多个字符都具有最高频率。

mystr = input("enter string: ")
#define dictionary to store characters and their frequencies
mydict = {}
#get the unique characters
unique_chars = sorted(set(mystr),key = mystr.index)
#store the characters and their respective frequencies in the dictionary
for c in unique_chars:
    ctr = 0
    for d in mystr:
        if d != " " and d == c:
            ctr = ctr + 1
    mydict[c] = ctr
print(mydict)
#store the maximum frequency
max_freq = max(mydict.values())
print("the highest frequency of occurence: ",max_freq)
#print all characters with highest frequency
print("the characters are:")
for k,v in mydict.items():
    if v == max_freq:
        print(k)

输入:“你好人”

输出:

{'o': 2, 'p': 2, 'h': 1, ' ': 0, 'e': 3, 'l': 3}

出现频率最高:3

字符是:

e

l

答案 9 :(得分:0)

可能是这样的:

testList = [1, 2, 3, 4, 2, 2, 1, 4, 4] print(max(set(testList), key = testList.count))

答案 10 :(得分:0)

简单,最佳的代码:

def max_occ(lst,x):
    count=0
    for i in lst:
        if (i==x):
            count=count+1
    return count

lst=[1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
x=max(lst,key=lst.count)
print(x,"occurs ",max_occ(lst,x),"times")

输出:4次出现6次

答案 11 :(得分:0)

我的代码(简单)(学习Python三个月):

def more_frequent_item(lst):
    new_lst = []
    times = 0
    for item in lst:
        count_num = lst.count(item)
        new_lst.append(count_num)
        times = max(new_lst)
    key = max(lst, key=lst.count)
    print("In the list: ")
    print(lst)
    print("The most frequent item is " + str(key) + ". Appears " + str(times) + " times in this list.")


more_frequent_item([1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67])

输出将是:

In the list: 
[1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
The most frequent item is 4. Appears 6 times in this list.

答案 12 :(得分:0)

如果您使用的是Python 3.4或更高版本,则可以使用statistics.mode()

>>> import statistics
>>> L = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67] 
>>> statistics.mode(L)
4

请注意,如果列表为空或没有一个最常见的值,则会抛出statistics.StatisticsError

答案 13 :(得分:0)

如果您在解决方案中使用 numpy 以加快计算速度,请使用:

import numpy as np
x = np.array([2,5,77,77,77,77,77,77,77,9,0,3,3,3,3,3])
y = np.bincount(x,minlength = max(x))
y = np.argmax(y)   
print(y)  #outputs 77