我有一个很长的小时值列表(10年),我想每天平均 第3列。这样每个日期将具有从第3列得到的平均值。
我的数据如下:
> 1/1/2005,16:00:00,83.3971,-3.8950
> 1/1/2005,17:00:00,0.0000,-3.9146
> 1/1/2005,18:00:00,0.0000,-3.9337
> 1/1/2005,19:00:00,0.0000,-3.9532
> 1/1/2005,20:00:00,0.0000,-3.9727
> 1/1/2005,21:00:00,0.0000,-3.9920
> 1/1/2005,22:00:00,0.0000,-4.0116
> 1/1/2005,23:00:00,0.0000,-4.0311
> 1/2/2005,0:00:00,0.0000,-4.0503
> 1/2/2005,1:00:00,0.0000,-4.0697
> 1/2/2005,2:00:00,0.0000,-4.0891
> 1/2/2005,3:00:00,0.0000,-4.1083
> 1/2/2005,4:00:00,0.0000,-4.1279
> 1/2/2005,5:00:00,0.0000,-4.1472
> 1/2/2005,6:00:00,0.0000,-4.1662
> 1/2/2005,7:00:00,0.0000,-4.1858
> 1/2/2005,8:00:00,0.0000,-4.2053
> 1/2/2005,9:00:00,152.7058,-4.2242
> 1/2/2005,10:00:00,302.6400,-4.2436
> 1/2/2005,11:00:00,405.2218,-4.2630
> 1/2/2005,12:00:00,452.6208,-4.2821
> 1/2/2005,13:00:00,441.4662,-4.3016
> 1/2/2005,14:00:00,372.5459,-4.3208
> 1/2/2005,15:00:00,250.8291,-4.3398
> 1/2/2005,16:00:00,86.6172,-4.3592
> 1/2/2005,17:00:00,0.0000,-4.3785
> 1/2/2005,18:00:00,0.0000,-4.3973
> 1/2/2005,19:00:00,0.0000,-4.4167
>...
12/30/2014,23:00:00,0.0000,0.7601 12/31/2014,0:00:00,0.0000,0.7601 12/31/2014,1:00:00,0.0000,0.7601 12/31/2014,2:00:00,0.0000,0.7601 12/31/2014,3:00:00,0.0000,0.7601 12/31/2014,4:00:00,0.0000,0.7601 12/31/2014,5:00:00,0.0000,0.7601 12/31/2014,6:00:00,0.0000,0.7601 12/31/2014,7:00:00,0.0000,0.7601 12/31/2014,8:00:00,0.0000,2.6808 12/31/2014,9:00:00,153.8084,1.6338 12/31/2014,10:00:00,301.9711,1.3491 12/31/2014,11:00:00,402.5888,1.2512 12/31/2014,12:00:00,447.9860,1.2191 12/31/2014,13:00:00,434.9283,1.2277
...
这可能是突出 "Split, Apply, Combine" 前提并使用简单案例的绝佳机会?
或许读取csv 到pandas,索引作为日期时间对象,然后 groupby day ,总和/除以计数(又名平均)?
问题: 我需要平均每日价值,我从上述10年,每小时的时间序列开始。同样,我有一个从2005年1月1日到2014年12月31日的每小时数据集,我想要基于该数据集的10年每日平均值的平均每日价值。你挖?
我已经从小时到每天使用:
df = pd.read_csv('file.csv', parse_dates='datetime':0,1]},index_col='datetime', header=True, usecols=[0,1,2])
day_avgs = df.groupby(pd.TimeGrouper('D'))
这会返回平均每日价值,的确如下所示:
date
2005-01-01 106.307291
2005-01-02 102.578729
2005-01-03 103.332883
2005-01-04 104.139979
2005-01-05 104.999592
... ...
2014-12-02 108.292092
2014-12-03 107.189729
2014-12-04 106.142721
2014-12-05 105.151696
但是,我对如何在“day_avgs”中对这些每日值进行分组感到困惑,因此请在每个日期(其中10个)进行分组,然后平均给出一个每日平均值,即所有这些日期的平均值超过10年的数据集。 Capiche?
即,根据10年的平均日数,我想得到一年中每天(365)的平均值。
答案 0 :(得分:0)
#!/usr/bin/env python
from datetime import datetime
import pandas
def same_day(date_string): # remove year
return datetime.strptime(date_string, "%m/%d/%Y").strftime('%m-%d')
df = pandas.read_csv('input.csv', index_col=0,
usecols=[0,2], names=['date', 'value'],
converters={'date': same_day})
print(df.groupby(level=0).mean())
value
date
01-01 143.991035
01-02 123.232340
12-30 0.000000
12-31 100.981233
它假设所有小时值在不同年份具有相同的权重。
pandas
允许索引中的重复值。
按日期(第1列)对数据进行分组,并找到第3列的平均值:
#!/usr/bin/env python
import pandas
df = pandas.read_csv('input.csv', parse_dates=True, index_col=0,
usecols=[0,2], names=['date', 'value'])
print(df.groupby(level=0).mean())
value
date
2005-01-01 143.991035
2005-01-02 123.232340
[2 rows x 1 columns]
使用itertools.groupby()
的代码产生相同的结果:
#!/usr/bin/env python
import csv
from collections import OrderedDict
from datetime import datetime
from itertools import groupby
from operator import itemgetter
from pprint import pprint
def groupby_mean(file):
mean = OrderedDict()
for day, same_day_rows in groupby(csv.reader(file), key=itemgetter(0)):
L = [float(row[2]) for row in same_day_rows]
mean[datetime.strptime(day, '%m/%d/%Y')] = sum(L) / len(L)
return mean
with open('input.csv') as file:
pprint(groupby_mean(file))
{datetime.datetime(2005, 1, 1, 0, 0): 143.99103529411764,
datetime.datetime(2005, 1, 2, 0, 0): 123.23234}
math.fsum(L)
与您的输入结果与sum(L)
相同。