Question

我是Python的新手，我有一组如下的值：

(3, '655')
(3, '645')
(3, '641')
(4, '602')
(4, '674')
(4, '620')

这是使用以下代码（python 2.6）从CSV文件生成的：

import csv
import time

with open('file.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        date = time.strptime(row[3], "%a %b %d %H:%M:%S %Z %Y")
        data = date, row[5]

        month = data[0][1]
        avg = data[1]
        monthAvg = month, avg
        print monthAvg

我想要做的是根据键获得平均值：

(3, 647)
(4, 632)

我最初的想法是创建一个新词典。

loop through the original dictionary
    if the key does not exist
        add the key and value to the new dictionary
    else
        sum the value to the existing value in the new dictionary

我还必须保持按键数量，这样才能产生平均值。看起来好像很多工作 - 我不确定是否有更优雅的方法来实现这一目标。

谢谢。

Answer 1

您可以使用collections.defaultdict创建包含唯一键和值列表的字典：

>>> l=[(3, '655'),(3, '645'),(3, '641'),(4, '602'),(4, '674'),(4, '620')]
>>> from collections import defaultdict
>>> d=defaultdict(list)
>>> 
>>> for i,j in l:
...    d[i].append(int(j))
... 
>>> d
defaultdict(<type 'list'>, {3: [655, 645, 641], 4: [602, 674, 620]})

然后使用列表推导来创建预期的对：

>>> [(i,sum(j)/len(j)) for i,j in d.items()]
[(3, 647), (4, 632)]

在您的代码中，您可以这样做：

with open('file.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        date = time.strptime(row[3], "%a %b %d %H:%M:%S %Z %Y")
        data = date, row[5]

        month = data[0][1]
        avg = data[1]
        d[month].append(int(avg))

     print [(i,sum(j)/len(j)) for i,j in d.items()]

Answer 2

使用pandas，它专门用于做这些事情，这意味着你只需要用少量代码表达它们（你想要做的就是单行）。此外，当给出大量值时，它将比任何其他方法快得多。

import pandas as pd

a=[(3, '655'),
   (3, '645'),
   (3, '641'),
   (4, '602'),
   (4, '674'),
   (4, '620')]

res = pd.DataFrame(a).astype('float').groupby(0).mean()
print(res)

给出：

这是一个多行版本，显示了会发生什么：

df = pd.DataFrame(a)  # construct a structure containing data
df = df.astype('float')  # convert data to float values
grp = df.groupby(0)  # group the values by the value in the first column
df = grp.mean()  # take the mean of each group

此外，如果您想使用csv文件，则更容易，因为您不需要自己解析csv文件（我使用的是我不喜欢的列的虚拟名称不知道：

import pandas as pd
df = pd.read_csv('file.csv', columns=['col0', 'col1', 'col2', 'date', 'col4', 'data'], index=False, header=None)
df['month'] = pd.DatetimeIndex(df['date']).month
df = df.loc[:,('month', 'data')].groupby('month').mean()

Answer 3

使用字典理解，其中items在元组对列表中：

data = {i:[int(b) for a, b in items if a == i] for i in set(a for a, b in items)}
data = {a:int(float(sum(b))/float(len(b))) for a, b in data.items()} # averages

Answer 4

import itertools,csv
from dateutil.parser import parse as dparse

def make_tuples(fname='file.csv'):
    with open(fname, 'rb') as csvfile:
        rows = list(csv.reader(csvfile))
        for month,data in itertools.groupby(rows,lambda x:dparse(x[3]).strftime("%b")):
             data = zip(*data)
             yield (month,sum(data[5])/float(len(data[5])))

print dict(make_tuples('some_csv.csv'))

是一种方法......

根据键平均字典中的值

4 个答案: