numpy:按组聚合4D数组

时间:2014-06-27 19:00:31

标签: numpy pandas itertools

我有一个形状为[t,z,x,y]的numpy数组,表示每小时三维数据的时间序列。数组的轴是时间,垂直坐标,水平坐标1,水平坐标2.还有一个每小时datetime.datetime时间戳的t元素列表。

我想计算每天的每日中午工资。这将是[nday,Z,X,Y]数组。

我试图找到一种pythonic方式来做到这一点。我用一堆for循环写了一些东西,虽然有效,但似乎很慢,不灵活,而且冗长。

在我看来,熊猫不是我的解决方案,因为我的时间序列数据是三维的。我很高兴被证明是错的。

我使用itertools想出这个,找到日期时间戳并按日期对它们进行分组,现在我很快就尝试使用imap来找到方法。

import numpy as np
import pandas as pd
import itertools

# create 72 hours of pseudo-data with 3 vertical levels and a 4 by 4
# horizontal grid.
data = np.zeros((72, 3, 4, 4))
t = pd.date_range(datetime(2008,7,1), freq='1H', periods=72)
for i in range(data.shape[0]):
    data[i,...] = i

# find the timestamps that are "midday" in North America.  We'll
# define midday as between 15:00 and 23:00 UTC, which is 10:00 EST to
# 15:00 PST.
def is_midday(this_t):
    return ((this_t.hour >= 15) and (this_t.hour <= 23))

# group the midday timestamps by date
for dt, grp in itertools.groupby(itertools.ifilter(is_midday, t),
                                 key=lambda x: x.date()):
    print 'date ' + str(dt)
    for g in grp:
        print g

# find means of mid-day data by date 
data_list = np.split(data, data.shape[0])
grps = itertools.groupby(itertools.ifilter(is_midday, t),
                         key=lambda x: x.date())
# how to apply itertools.imap (or something else) to data_list and
# grps?  Or somehow split data along axis 0 according to grps? 

1 个答案:

答案 0 :(得分:0)

你可以将任何物体推入熊猫结构中。通常不推荐,但在这种情况下它可能适合你。

创建一个按时间索引的系列,每个元素都是一个3-d numpy数组

In [117]: s = Series([data[i] for i in range(data.shape[0])],index=t)

In [118]: s
Out[118]: 
2008-07-01 00:00:00    [[[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], ...
2008-07-01 01:00:00    [[[1.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0], ...
2008-07-01 02:00:00    [[[2.0, 2.0, 2.0, 2.0], [2.0, 2.0, 2.0, 2.0], ...
2008-07-01 03:00:00    [[[3.0, 3.0, 3.0, 3.0], [3.0, 3.0, 3.0, 3.0], ...
2008-07-01 04:00:00    [[[4.0, 4.0, 4.0, 4.0], [4.0, 4.0, 4.0, 4.0], ...
2008-07-01 05:00:00    [[[5.0, 5.0, 5.0, 5.0], [5.0, 5.0, 5.0, 5.0], ...
2008-07-01 06:00:00    [[[6.0, 6.0, 6.0, 6.0], [6.0, 6.0, 6.0, 6.0], ...
2008-07-01 07:00:00    [[[7.0, 7.0, 7.0, 7.0], [7.0, 7.0, 7.0, 7.0], ...
2008-07-01 08:00:00    [[[8.0, 8.0, 8.0, 8.0], [8.0, 8.0, 8.0, 8.0], ...
2008-07-01 09:00:00    [[[9.0, 9.0, 9.0, 9.0], [9.0, 9.0, 9.0, 9.0], ...
2008-07-01 10:00:00    [[[10.0, 10.0, 10.0, 10.0], [10.0, 10.0, 10.0,...
2008-07-01 11:00:00    [[[11.0, 11.0, 11.0, 11.0], [11.0, 11.0, 11.0,...
2008-07-01 12:00:00    [[[12.0, 12.0, 12.0, 12.0], [12.0, 12.0, 12.0,...
2008-07-01 13:00:00    [[[13.0, 13.0, 13.0, 13.0], [13.0, 13.0, 13.0,...
2008-07-01 14:00:00    [[[14.0, 14.0, 14.0, 14.0], [14.0, 14.0, 14.0,...
...
2008-07-03 09:00:00    [[[57.0, 57.0, 57.0, 57.0], [57.0, 57.0, 57.0,...
2008-07-03 10:00:00    [[[58.0, 58.0, 58.0, 58.0], [58.0, 58.0, 58.0,...
2008-07-03 11:00:00    [[[59.0, 59.0, 59.0, 59.0], [59.0, 59.0, 59.0,...
2008-07-03 12:00:00    [[[60.0, 60.0, 60.0, 60.0], [60.0, 60.0, 60.0,...
2008-07-03 13:00:00    [[[61.0, 61.0, 61.0, 61.0], [61.0, 61.0, 61.0,...
2008-07-03 14:00:00    [[[62.0, 62.0, 62.0, 62.0], [62.0, 62.0, 62.0,...
2008-07-03 15:00:00    [[[63.0, 63.0, 63.0, 63.0], [63.0, 63.0, 63.0,...
2008-07-03 16:00:00    [[[64.0, 64.0, 64.0, 64.0], [64.0, 64.0, 64.0,...
2008-07-03 17:00:00    [[[65.0, 65.0, 65.0, 65.0], [65.0, 65.0, 65.0,...
2008-07-03 18:00:00    [[[66.0, 66.0, 66.0, 66.0], [66.0, 66.0, 66.0,...
2008-07-03 19:00:00    [[[67.0, 67.0, 67.0, 67.0], [67.0, 67.0, 67.0,...
2008-07-03 20:00:00    [[[68.0, 68.0, 68.0, 68.0], [68.0, 68.0, 68.0,...
2008-07-03 21:00:00    [[[69.0, 69.0, 69.0, 69.0], [69.0, 69.0, 69.0,...
2008-07-03 22:00:00    [[[70.0, 70.0, 70.0, 70.0], [70.0, 70.0, 70.0,...
2008-07-03 23:00:00    [[[71.0, 71.0, 71.0, 71.0], [71.0, 71.0, 71.0,...
Freq: H, Length: 72

定义聚合函数。您需要访问返回内部对象的值; concatenating强制回到一个实际的numpy数组,然后聚合(在这种情况下是指)

In [119]: def f(g,grp):
   .....:     return np.concatenate(grp.values).mean()
   .....: 

由于不确定你的结果输出应该是什么样的,只需手动创建一个基于时间的分组器(这实际上是一个重新采样),但不对最终结果做任何事情(它只是一个聚合值列表)

In [121]: [ f(g,grp) for g, grp in s.groupby(pd.Grouper(freq='D')) ]
Out[121]: [11.5, 35.5, 59.5]

你可以在这里得到合理的想法并说回复一个pandas对象(可能concat)。