Question

我有一个这样的列表：＃[YEAR，DAY，VALUE1，VALUE2，VALUE3]

[[2014, 1, 10, 20, 30],
[2014, 1, 3, 7, 4],
[2014, 2, 14, 43,5],
[2014, 2, 33, 1, 6]
...
[2013, 1, 34, 54, 3],
[2013, 2, 23, 33, 2],
...]

我需要按年和日分组，以获得类似的内容：

[[2014, 1, sum[all values1 with day=1), sum(all values2 with day =1), avg(all values3 with day=1)],
[2014, 2, sum[all values1 with day=2), sum(all values2 with day =2), avg(all values3 with day=2)],
....
[2013, 1, sum[all values1 with day=1), sum(all values2 with day =1), avg(all values3 with day=1)],
[2013, 2, sum[all values1 with day=2), sum(all values2 with day =2), avg(all values3 with day=2)],,
....]

我怎么能用itertool做到这一点？我不能使用pandas或numpy因为我的系统不支持它。非常感谢你的帮助。

Answer 1

import itertools
import operator

key = operator.itemgetter(0,1)
my_list.sort(key=key)
for (year, day), records in itertools.groupby(my_list, key):
    print("Records on", year, day, ":")
    for record in records: print(record)

itertools.groupby不像SQL的GROUPBY那样有效。它按顺序分组。这意味着如果您有一个未排序的元素列表，您可能会在同一个键上获得多个组。所以，假设你想根据它们的奇偶校验（甚至是奇数）对整数列表进行分组，那么你可以这样做：

L = [1,2,3,4,5,7,8]  # notice that there's no 6 in the list
itertools.groupby(L, lambda i:i%2)

现在，如果你来自一个SQL世界，你可能会认为这会给你两个组 - 一组用于偶数，一组用于奇数。虽然这是有道理的，但Python并不是这样做的。它依次考虑每个元素并检查它是否属于与前一个元素相同的组。如果是，则将两个元素添加到组中;否则，每个元素都有自己的组。

因此，通过以上列表，我们得到：

key: 1
elements: [1]

key: 0
elements[2]

key: 1
elements: [3]

key: 0
elements[4]

key: 1
elements: [5,7]  # see what happened here?

因此，如果您希望在SQL中进行分组，那么您需要事先按照要分组的键（条件）对列表进行排序：

L = [1,2,3,4,5,7,8]  # notice that there's no 6 in the list
L.sort(key=lambda i:i%2)  # now L looks like this: [2,4,1,3,5,7] - the odds and the evens stick together
itertools.groupby(L, lambda i:%2)  # this gives two groups containing all the elements that belong to each group

Answer 2

我试图做一个简短而简洁的回答，但我没有成功，但我设法得到了很多python内置模块：

import itertools
import operator
import functools

我将使用functools.reduce来完成总和，但它需要一个自定义函数：

def sum_sum_sum_counter(res, array):
    # Unpack the values of the array
    year, day, val1, val2, val3 = array
    res[0] += val1
    res[1] += val2
    res[2] += val3
    res[3] += 1 # counter
    return res

这个函数有一个计数器，因为你想要计算平均值，它比运行的平均值实现更直观。

现在有趣的部分：我将按前两个元素进行分组（假设这些元素已经排序，否则之前需要lst = sorted(lst, key=operator.itemgetter(0,1))之类的元素：

result = []
for i, values in itertools.groupby(lst, operator.itemgetter(0,1)):
    # Now let's use the reduce function with a start list containing zeros
    calc = functools.reduce(sum_sum_sum_counter, values, [0, 0, 0, 0])
    # Append year, day and the results.
    result.append([i[0], i[1], calc[0], calc[1], calc[2]/calc[3]])

calc[2]/calc[3]是value3的平均值。请记住reduce函数中的最后一个元素是一个计数器！除以计数之和就是平均值。

给我一个结果：

[[2014, 1, 13, 27, 17.0],
 [2014, 2, 47, 44, 5.5],
 [2013, 1, 34, 54, 3.0],
 [2013, 2, 23, 33, 2.0]]

只使用你给出的那些值。

Answer 3

在真实数据上，分组之前的排序可能会变得效率低下：

首先，整个迭代器将被消耗，失去函数式编程的一个重要目标，懒惰
与分组O（n）相比，排序为O（n log n）

以某些谓词SQL + pythonic方式分组，有些简单使用集合减少/累积。defaultdict可以做到：

from functools import reduce
from collections import defaultdict as DD

def groupby( pred, it ):
  return reduce( lambda d,x: d[ pred(x) ].append(x) or d, it, DD(list) )

然后将其与某些谓词函数或lambda一起使用：

>>> words = 'your code might become less readable using reduce'.split()
>>> groupby( len, words )[4]
['your', 'code', 'less']

关于懒惰，reduce在消耗所有输入之前不会返回，当然没有。您可以使用itertools.accumulate，总是返回相同的defaultdict，以懒惰地消耗（并处理变化的组）并且占用较少的内存。

python groupby itertools列表方法

3 个答案: