我有一个形状为[t,z,x,y]的numpy数组,表示每小时三维数据的时间序列。数组的轴是时间,垂直坐标,水平坐标1,水平坐标2.还有一个每小时datetime.datetime时间戳的t元素列表。
我想计算每天的每日中午工资。这将是[nday,Z,X,Y]数组。
我试图找到一种pythonic方式来做到这一点。我用一堆for循环写了一些东西,虽然有效,但似乎很慢,不灵活,而且冗长。
在我看来,熊猫不是我的解决方案,因为我的时间序列数据是三维的。我很高兴被证明是错的。
我使用itertools想出这个,找到日期时间戳并按日期对它们进行分组,现在我很快就尝试使用imap来找到方法。
import numpy as np
import pandas as pd
import itertools
# create 72 hours of pseudo-data with 3 vertical levels and a 4 by 4
# horizontal grid.
data = np.zeros((72, 3, 4, 4))
t = pd.date_range(datetime(2008,7,1), freq='1H', periods=72)
for i in range(data.shape[0]):
data[i,...] = i
# find the timestamps that are "midday" in North America. We'll
# define midday as between 15:00 and 23:00 UTC, which is 10:00 EST to
# 15:00 PST.
def is_midday(this_t):
return ((this_t.hour >= 15) and (this_t.hour <= 23))
# group the midday timestamps by date
for dt, grp in itertools.groupby(itertools.ifilter(is_midday, t),
key=lambda x: x.date()):
print 'date ' + str(dt)
for g in grp:
print g
# find means of mid-day data by date
data_list = np.split(data, data.shape[0])
grps = itertools.groupby(itertools.ifilter(is_midday, t),
key=lambda x: x.date())
# how to apply itertools.imap (or something else) to data_list and
# grps? Or somehow split data along axis 0 according to grps?
答案 0 :(得分:0)
你可以将任何物体推入熊猫结构中。通常不推荐,但在这种情况下它可能适合你。
创建一个按时间索引的系列,每个元素都是一个3-d numpy数组
In [117]: s = Series([data[i] for i in range(data.shape[0])],index=t)
In [118]: s
Out[118]:
2008-07-01 00:00:00 [[[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], ...
2008-07-01 01:00:00 [[[1.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0], ...
2008-07-01 02:00:00 [[[2.0, 2.0, 2.0, 2.0], [2.0, 2.0, 2.0, 2.0], ...
2008-07-01 03:00:00 [[[3.0, 3.0, 3.0, 3.0], [3.0, 3.0, 3.0, 3.0], ...
2008-07-01 04:00:00 [[[4.0, 4.0, 4.0, 4.0], [4.0, 4.0, 4.0, 4.0], ...
2008-07-01 05:00:00 [[[5.0, 5.0, 5.0, 5.0], [5.0, 5.0, 5.0, 5.0], ...
2008-07-01 06:00:00 [[[6.0, 6.0, 6.0, 6.0], [6.0, 6.0, 6.0, 6.0], ...
2008-07-01 07:00:00 [[[7.0, 7.0, 7.0, 7.0], [7.0, 7.0, 7.0, 7.0], ...
2008-07-01 08:00:00 [[[8.0, 8.0, 8.0, 8.0], [8.0, 8.0, 8.0, 8.0], ...
2008-07-01 09:00:00 [[[9.0, 9.0, 9.0, 9.0], [9.0, 9.0, 9.0, 9.0], ...
2008-07-01 10:00:00 [[[10.0, 10.0, 10.0, 10.0], [10.0, 10.0, 10.0,...
2008-07-01 11:00:00 [[[11.0, 11.0, 11.0, 11.0], [11.0, 11.0, 11.0,...
2008-07-01 12:00:00 [[[12.0, 12.0, 12.0, 12.0], [12.0, 12.0, 12.0,...
2008-07-01 13:00:00 [[[13.0, 13.0, 13.0, 13.0], [13.0, 13.0, 13.0,...
2008-07-01 14:00:00 [[[14.0, 14.0, 14.0, 14.0], [14.0, 14.0, 14.0,...
...
2008-07-03 09:00:00 [[[57.0, 57.0, 57.0, 57.0], [57.0, 57.0, 57.0,...
2008-07-03 10:00:00 [[[58.0, 58.0, 58.0, 58.0], [58.0, 58.0, 58.0,...
2008-07-03 11:00:00 [[[59.0, 59.0, 59.0, 59.0], [59.0, 59.0, 59.0,...
2008-07-03 12:00:00 [[[60.0, 60.0, 60.0, 60.0], [60.0, 60.0, 60.0,...
2008-07-03 13:00:00 [[[61.0, 61.0, 61.0, 61.0], [61.0, 61.0, 61.0,...
2008-07-03 14:00:00 [[[62.0, 62.0, 62.0, 62.0], [62.0, 62.0, 62.0,...
2008-07-03 15:00:00 [[[63.0, 63.0, 63.0, 63.0], [63.0, 63.0, 63.0,...
2008-07-03 16:00:00 [[[64.0, 64.0, 64.0, 64.0], [64.0, 64.0, 64.0,...
2008-07-03 17:00:00 [[[65.0, 65.0, 65.0, 65.0], [65.0, 65.0, 65.0,...
2008-07-03 18:00:00 [[[66.0, 66.0, 66.0, 66.0], [66.0, 66.0, 66.0,...
2008-07-03 19:00:00 [[[67.0, 67.0, 67.0, 67.0], [67.0, 67.0, 67.0,...
2008-07-03 20:00:00 [[[68.0, 68.0, 68.0, 68.0], [68.0, 68.0, 68.0,...
2008-07-03 21:00:00 [[[69.0, 69.0, 69.0, 69.0], [69.0, 69.0, 69.0,...
2008-07-03 22:00:00 [[[70.0, 70.0, 70.0, 70.0], [70.0, 70.0, 70.0,...
2008-07-03 23:00:00 [[[71.0, 71.0, 71.0, 71.0], [71.0, 71.0, 71.0,...
Freq: H, Length: 72
定义聚合函数。您需要访问返回内部对象的值; concatenating
强制回到一个实际的numpy数组,然后聚合(在这种情况下是指)
In [119]: def f(g,grp):
.....: return np.concatenate(grp.values).mean()
.....:
由于不确定你的结果输出应该是什么样的,只需手动创建一个基于时间的分组器(这实际上是一个重新采样),但不对最终结果做任何事情(它只是一个聚合值列表)
In [121]: [ f(g,grp) for g, grp in s.groupby(pd.Grouper(freq='D')) ]
Out[121]: [11.5, 35.5, 59.5]
你可以在这里得到合理的想法并说回复一个pandas对象(可能concat
)。