Question

Assume we have dictionary:

items = {'a': 7, 'b': 12, 'c': 9, 'd': 0, 'e': 24, 'f': 10, 'g': 24}

I want to get another dictionary, containing 4 elements with maximum values. E.g. I expect to get:

subitems = {'e': 24, 'g': 24, 'b': 12, 'f': 10}

What will be most pythonic and efficient (memory consumption, execution speed - when f.e. I'll have dict with 1000000 elements) way to do this? Generators, lambdas, something another?

Answer 1

heapq.nlargest is always the correct answer when the question is "How do I get a small number of maximum values from a huge set of inputs?" It minimizes memory usage and CPU usage better than just about anything else you could do in Python, by using heaps. Example:

import heapq
from operator import itemgetter

items = {'a': 7, 'b': 12, 'c': 9, 'd': 0, 'e': 24, 'f': 10, 'g': 24}

topitems = heapq.nlargest(items.items(), key=itemgetter(1))  # Use .iteritems() on Py2
topitemsasdict = dict(topitems)

sorted and slicing the result can win when the number of max items requested is a large percentage of the input, but for huge inputs and small numbers of max items, the memory savings of heapq.nlargest will win.

For the CS theory geeks, heapq.nlargest, for an input of size n, selecting the k max values, requires O(n log k) computation, and k storage. sorted followed by slicing requires O(n log n) computation and n storage. So for 1024 inputs and 4 selected items, work for nlargest is ~1024 * 2 computation with storage required of 4; sorted + slicing would be ~1024 * 10 computation with storage of 1024. In practice, Python's TimSort used in sorted has lower overhead than big-O notation can properly convey, and usually performs better than the big-O notation would indicate, which is why for, say, selecting the top 200 items out of 1024, sorted + slicing can still win, but nlargest lacks pathological degradation for huge inputs and outputs; it may be slower on occasion, but it's usually not much slower, where sorted can be faster, but it can also be much slower.

Answer 2

Check the source code of collections.Counter.most_common() method. It shows the best solution. And of course, the best way is using Counter() instead of {}.

def most_common(self, n=None):
    '''List the n most common elements and their counts from the most
    common to the least.  If n is None, then list all element counts.

    >>> Counter('abcdeabcdabcaba').most_common(3)
    [('a', 5), ('b', 4), ('c', 3)]

    '''
    # Emulate Bag.sortedByCount from Smalltalk
    if n is None:
        return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
    return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))

Python Get N max values from dictionary

2 个答案: