Assume we have dictionary:
items = {'a': 7, 'b': 12, 'c': 9, 'd': 0, 'e': 24, 'f': 10, 'g': 24}
I want to get another dictionary, containing 4 elements with maximum values. E.g. I expect to get:
subitems = {'e': 24, 'g': 24, 'b': 12, 'f': 10}
What will be most pythonic and efficient (memory consumption, execution speed - when f.e. I'll have dict with 1000000 elements) way to do this? Generators, lambdas, something another?
答案 0 :(得分:4)
heapq.nlargest
is always the correct answer when the question is "How do I get a small number of maximum values from a huge set of inputs?" It minimizes memory usage and CPU usage better than just about anything else you could do in Python, by using heaps. Example:
import heapq
from operator import itemgetter
items = {'a': 7, 'b': 12, 'c': 9, 'd': 0, 'e': 24, 'f': 10, 'g': 24}
topitems = heapq.nlargest(items.items(), key=itemgetter(1)) # Use .iteritems() on Py2
topitemsasdict = dict(topitems)
sorted
and slicing the result can win when the number of max items requested is a large percentage of the input, but for huge inputs and small numbers of max items, the memory savings of heapq.nlargest
will win.
For the CS theory geeks, heapq.nlargest
, for an input of size n
, selecting the k
max values, requires O(n log k)
computation, and k
storage. sorted
followed by slicing requires O(n log n)
computation and n
storage. So for 1024 inputs and 4 selected items, work for nlargest
is ~1024 * 2 computation with storage required of 4; sorted
+ slicing would be ~1024 * 10 computation with storage of 1024. In practice, Python's TimSort used in sorted
has lower overhead than big-O notation can properly convey, and usually performs better than the big-O notation would indicate, which is why for, say, selecting the top 200 items out of 1024, sorted
+ slicing can still win, but nlargest
lacks pathological degradation for huge inputs and outputs; it may be slower on occasion, but it's usually not much slower, where sorted can be faster, but it can also be much slower.
答案 1 :(得分:1)
Check the source code of collections.Counter.most_common()
method. It shows the best solution. And of course, the best way is using Counter()
instead of {}
.
def most_common(self, n=None):
'''List the n most common elements and their counts from the most
common to the least. If n is None, then list all element counts.
>>> Counter('abcdeabcdabcaba').most_common(3)
[('a', 5), ('b', 4), ('c', 3)]
'''
# Emulate Bag.sortedByCount from Smalltalk
if n is None:
return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))