我有一个数据点列表,其中包含24小时内每5分钟一次的测量。我需要创建一个新列表,其中包含列表中每小时的平均值。最好的方法是什么?
Date Amount
2015-03-14T00:00:00.000-04:00 12545.869
2015-03-14T00:05:00.000-04:00 12467.326
2015-03-14T00:10:00.000-04:00 12416.948
2015-03-14T00:15:00.000-04:00 12315.698
2015-03-14T00:20:00.000-04:00 12276.38
2015-03-14T00:25:00.000-04:00 12498.696
2015-03-14T00:30:00.000-04:00 12426.145
2015-03-14T00:35:00.000-04:00 12368.659
2015-03-14T00:40:00.000-04:00 12322.785
2015-03-14T00:45:00.000-04:00 12292.719
2015-03-14T00:50:00.000-04:00 12257.965
2015-03-14T00:55:00.000-04:00 12221.375
2015-03-14T01:00:00.000-04:00 12393.725
2015-03-14T01:05:00.000-04:00 12366.674
2015-03-14T01:10:00.000-04:00 12378.578
2015-03-14T01:15:00.000-04:00 12340.754
2015-03-14T01:20:00.000-04:00 12288.511
2015-03-14T01:25:00.000-04:00 12266.136
2015-03-14T01:30:00.000-04:00 12236.639
2015-03-14T01:35:00.000-04:00 12181.668
2015-03-14T01:40:00.000-04:00 12171.992
2015-03-14T01:45:00.000-04:00 12164.298
2015-03-14T01:50:00.000-04:00 12137.282
2015-03-14T01:55:00.000-04:00 12116.486
2015-03-14T02:00:02.000-04:00 12090.439
2015-03-14T02:05:00.000-04:00 12085.924
2015-03-14T02:10:00.000-04:00 12034.78
2015-03-14T02:15:00.000-04:00 12037.367
2015-03-14T02:20:00.000-04:00 12006.649
2015-03-14T02:25:00.000-04:00 11985.588
2015-03-14T02:30:00.000-04:00 11999.41
2015-03-14T02:35:00.000-04:00 11943.121
2015-03-14T02:40:00.000-04:00 11934.346
2015-03-14T02:45:00.000-04:00 11928.568
2015-03-14T02:50:00.000-04:00 11918.63
2015-03-14T02:55:00.000-04:00 11885.698
2015-03-14T03:00:00.000-04:00 11863.065
2015-03-14T03:05:00.000-04:00 11883.256
2015-03-14T03:10:00.000-04:00 11870.095
2015-03-14T03:15:00.000-04:00 11849.104
2015-03-14T03:20:00.000-04:00 11849.18
2015-03-14T03:25:00.000-04:00 11834.229
2015-03-14T03:30:00.000-04:00 11826.603
2015-03-14T03:35:00.000-04:00 11823.516
2015-03-14T03:40:00.000-04:00 11849.386
2015-03-14T03:45:00.000-04:00 11832.385
2015-03-14T03:50:00.000-04:00 11847.059
2015-03-14T03:55:00.000-04:00 11831.807
2015-03-14T04:00:00.000-04:00 11844.027
2015-03-14T04:05:00.000-04:00 11873.114
2015-03-14T04:10:00.000-04:00 11904.105
2015-03-14T04:15:00.000-04:00 11879.018
2015-03-14T04:20:00.000-04:00 11899.658
2015-03-14T04:25:00.000-04:00 11887.808
2015-03-14T04:30:00.000-04:00 11879.875
2015-03-14T04:35:00.000-04:00 11924.149
2015-03-14T04:40:00.000-04:00 11929.499
2015-03-14T04:45:00.000-04:00 11932.086
2015-03-14T04:50:00.000-04:00 11989.847
2015-03-14T04:55:00.000-04:00 12000.971
答案 0 :(得分:3)
这是对itertools.groupby
的一种美妙使用,因为你实际上可以利用它返回的生成器而不是立即制作它们的列表或其他东西:
import itertools, pprint
d = {}
for (key,gen) in itertools.groupby(lst, key=lambda l: int(l[0][11:13])):
d[key] = sum(v for (d,v) in gen)
pprint.pprint(d)
平均值而不是 sum :
import itertools, pprint
def avg(gf):
_sum = 0
for (i,e) in enumerate(gf): _sum += e
return float(_sum) / (i+1)
d = {}
for (key,gen) in itertools.groupby(lst, key=lambda l: int(l[0][11:13])):
#d[key] = sum(v for (d,v) in gen)
d[key] = avg(v for (d,v) in gen)
pprint.pprint(d)
输出:
{0: 148410.565, 1: 147042.743, 2: 143850.52000000002, 3: 142159.685, 4: 142944.15699999998}
字典的键([0,1,2,3,4]
)对应于时间戳的小时。
输入:
lst = [ ['2015-03-14T00:00:00.000-04:00', 12545.869 ], ['2015-03-14T00:05:00.000-04:00', 12467.326], ['2015-03-14T00:10:00.000-04:00', 12416.948], ['2015-03-14T00:15:00.000-04:00', 12315.698], ['2015-03-14T00:20:00.000-04:00', 12276.38], ['2015-03-14T00:25:00.000-04:00', 12498.696], ['2015-03-14T00:30:00.000-04:00', 12426.145], ['2015-03-14T00:35:00.000-04:00', 12368.659], ['2015-03-14T00:40:00.000-04:00', 12322.785], ['2015-03-14T00:45:00.000-04:00', 12292.719], ['2015-03-14T00:50:00.000-04:00', 12257.965], ['2015-03-14T00:55:00.000-04:00', 12221.375], ['2015-03-14T01:00:00.000-04:00', 12393.725], ['2015-03-14T01:05:00.000-04:00', 12366.674], ['2015-03-14T01:10:00.000-04:00', 12378.578], ['2015-03-14T01:15:00.000-04:00', 12340.754], ['2015-03-14T01:20:00.000-04:00', 12288.511], ['2015-03-14T01:25:00.000-04:00', 12266.136], ['2015-03-14T01:30:00.000-04:00', 12236.639], ['2015-03-14T01:35:00.000-04:00', 12181.668], ['2015-03-14T01:40:00.000-04:00', 12171.992], ['2015-03-14T01:45:00.000-04:00', 12164.298], ['2015-03-14T01:50:00.000-04:00', 12137.282], ['2015-03-14T01:55:00.000-04:00', 12116.486], ['2015-03-14T02:00:02.000-04:00', 12090.439], ['2015-03-14T02:05:00.000-04:00', 12085.924], ['2015-03-14T02:10:00.000-04:00', 12034.78], ['2015-03-14T02:15:00.000-04:00', 12037.367], ['2015-03-14T02:20:00.000-04:00', 12006.649], ['2015-03-14T02:25:00.000-04:00', 11985.588], ['2015-03-14T02:30:00.000-04:00', 11999.41], ['2015-03-14T02:35:00.000-04:00', 11943.121], ['2015-03-14T02:40:00.000-04:00', 11934.346], ['2015-03-14T02:45:00.000-04:00', 11928.568], ['2015-03-14T02:50:00.000-04:00', 11918.63], ['2015-03-14T02:55:00.000-04:00', 11885.698], ['2015-03-14T03:00:00.000-04:00', 11863.065], ['2015-03-14T03:05:00.000-04:00', 11883.256], ['2015-03-14T03:10:00.000-04:00', 11870.095], ['2015-03-14T03:15:00.000-04:00', 11849.104], ['2015-03-14T03:20:00.000-04:00', 11849.18], ['2015-03-14T03:25:00.000-04:00', 11834.229], ['2015-03-14T03:30:00.000-04:00', 11826.603], ['2015-03-14T03:35:00.000-04:00', 11823.516], ['2015-03-14T03:40:00.000-04:00', 11849.386], ['2015-03-14T03:45:00.000-04:00', 11832.385], ['2015-03-14T03:50:00.000-04:00', 11847.059], ['2015-03-14T03:55:00.000-04:00', 11831.807], ['2015-03-14T04:00:00.000-04:00', 11844.027], ['2015-03-14T04:05:00.000-04:00', 11873.114], ['2015-03-14T04:10:00.000-04:00', 11904.105], ['2015-03-14T04:15:00.000-04:00', 11879.018], ['2015-03-14T04:20:00.000-04:00', 11899.658], ['2015-03-14T04:25:00.000-04:00', 11887.808], ['2015-03-14T04:30:00.000-04:00', 11879.875], ['2015-03-14T04:35:00.000-04:00', 11924.149], ['2015-03-14T04:40:00.000-04:00', 11929.499], ['2015-03-14T04:45:00.000-04:00', 11932.086], ['2015-03-14T04:50:00.000-04:00', 11989.847], ['2015-03-14T04:55:00.000-04:00', 12000.971], ]
修改:根据评论中的讨论,该怎么做:
import itertools, pprint
def avg(gf):
_sum = 0
for (i,e) in enumerate(gf): _sum += e
return float(_sum) / (i+1)
d = {}
for (key,gen) in itertools.groupby(lst, key=lambda l: int(l[0][11:13])):
vals = list(gen) # Unpack generator
key = vals[0][0][:13]
d[key] = avg(v for (d,v) in vals)
pprint.pprint(d)
答案 1 :(得分:0)
您可以使用各种工具轻松完成此操作,但为简单起见,我将使用简单的循环:
>>> with open("listfile.txt", "r") as e:
>>> list_ = e.read().splitlines()
>>> list_ = list_[1:] # Grab all but the first line
>>>
>>> dateValue = dict()
>>> for row in list_:
>>> date, value - row.split()
>>> if ":00:" in date:
>>> # Start new value
>>> amount = int(value)
>>>
>>> elif ":55:" in date:
>>> # End new value
>>> date = date.split(':') # Grab only date and hour info
>>> dateValue[date] = amount / 12. # Returns a float, remove the period to return an integer
>>> del amount # Just in case the data isn't uniform, so it raises an error
>>>
>>> else:
>>> date += int(value)
如果要将其导出到列表,请执行以下操作:
>>> listDate = list()
>>> listAmount = list()
>>> for k in sorted(dateValue.keys() ):
>>> v = dateValue.get(k)
>>>
>>> listDate.append(k)
>>> listAmount.append(v)
答案 2 :(得分:0)
reads= [
'2015-03-14T00:00:00.000-04:00 12545.869',
'2015-03-14T00:05:00.000-04:00 12467.326',
'2015-03-14T00:10:00.000-04:00 12416.948',
'2015-03-14T00:15:00.000-04:00 12315.698',
'2015-03-14T00:20:00.000-04:00 12276.38',
'2015-03-14T00:25:00.000-04:00 12498.696',
'2015-03-14T00:30:00.000-04:00 12426.145',
'2015-03-14T00:35:00.000-04:00 12368.659',
'2015-03-14T00:40:00.000-04:00 12322.785',
'2015-03-14T00:45:00.000-04:00 12292.719',
'2015-03-14T00:50:00.000-04:00 12257.965',
'2015-03-14T00:55:00.000-04:00 12221.375',
'2015-03-14T01:00:00.000-04:00 12393.725',
'2015-03-14T01:05:00.000-04:00 12366.674',
'2015-03-14T01:10:00.000-04:00 12378.578',
'2015-03-14T01:15:00.000-04:00 12340.754',
'2015-03-14T01:20:00.000-04:00 12288.511',
'2015-03-14T01:25:00.000-04:00 12266.136',
'2015-03-14T01:30:00.000-04:00 12236.639',
'2015-03-14T01:35:00.000-04:00 12181.668',
'2015-03-14T01:40:00.000-04:00 12171.992',
'2015-03-14T01:45:00.000-04:00 12164.298',
'2015-03-14T01:50:00.000-04:00 12137.282',
'2015-03-14T01:55:00.000-04:00 12116.486'
]
sums = {}
for read in reads:
hour = read.split(':')[0]
value = float(read.split().pop())
if hour in sums:
sums[hour] += value
else:
sums[hour] = value
avg = {}
for s in sums:
avg[s] = sums[s]/12
print avg