Question

我必须处理传感器数据（特别是来自ros，但不相关）。为此，我有几个二维numpy数组，其中一行存储时间戳，其后的一行存储相应的传感器数据。问题是，这样的数组没有相同的尺寸（不同的采样时间）。我需要将所有这些数组合并为一个大数组。如何根据时间戳进行操作，例如将丢失的数字替换为0或NaN？

我的情况示例：

import numpy as np

time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)

a=np.array((time1,data1))
print(a)

time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)

b=np.array((time2,data2))

print(b)

哪个返回输出

[[  1   2   3   4   5   6   7   8   9]
 [ 51   9 117 174 164  60  95 197  30]]

[[  1   3   5   7   9]
 [ 35 188 114 153  36]]

我要找的是

[[  1   2   3   4   5   6   7   8   9]
 [ 51   9 117 174 164  60  95 197  30]
 [ 35   0 188   0 114   0 153   0  36]]

有什么办法可以有效地实现这一目标？这是一个示例，但我正在处理数千个示例。谢谢！

Answer 1

对于一个b矩阵的简单情况

在a的第一行存储所有可能的时间戳记的同时，a和b中的第一行均已排序，我们可以使用np.searchsorted-

idx = np.searchsorted(a[0],b[0])
out_dtype = np.result_type((a.dtype,b.dtype))
b0 = np.zeros(a.shape[1],dtype=out_dtype)
b0[idx] = b[1]
out = np.vstack((a,b0))

对于多个b矩阵

方法1

要扩展到多个 b矩阵，我们可以在循环内对np.searchsorted使用类似的方法，如下所示-

def merge_arrays(a, B):
    # a : Array with first row holding all possible timestamps
    # B : list or tuple of all b-matrices

    lens = np.array([len(i) for i in B])
    L = (lens-1).sum() + len(a)
    out_dtype = np.result_type(*[i.dtype for i in B])
    out = np.zeros((L, a.shape[1]), dtype=out_dtype)
    out[:len(a)] = a
    s = len(a)
    for b_i in B:
        idx = np.searchsorted(a[0],b_i[0])
        out[s:s+len(b_i)-1,idx] = b_i[1:]
        s += len(b_i)-1
    return out

样品运行-

In [175]: a
Out[175]: 
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]])

In [176]: b0
Out[176]: 
array([[16, 22, 34, 56, 67, 91],
       [20, 80, 69, 79, 47, 64],
       [82, 88, 49, 29, 19, 19]])

In [177]: b1
Out[177]: 
array([[ 4, 16, 34, 99],
       [28, 34,  0,  0],
       [36, 53,  5, 38],
       [17, 79,  4, 42]])

In [178]: merge_arrays(a, [b0,b1])
Out[178]: 
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 0,  0, 20, 80, 69, 79, 47,  0, 64,  0],
       [ 0,  0, 82, 88, 49, 29, 19,  0, 19,  0],
       [28,  0, 34,  0,  0,  0,  0,  0,  0,  0],
       [36,  0, 53,  0,  5,  0,  0,  0,  0, 38],
       [17,  0, 79,  0,  4,  0,  0,  0,  0, 42]])

方法2

如果似乎使用np.searchsorted循环是瓶颈，我们可以矢量化该部分-

def merge_arrays_v2(a, B):
    # a : Array with first row holding all possible timestamps
    # B : list or tuple of all b-matrices

    lens = np.array([len(i) for i in B])
    L = (lens-1).sum() + len(a)
    out_dtype = np.result_type(*[i.dtype for i in B])
    out = np.zeros((L, a.shape[1]), dtype=out_dtype)
    out[:len(a)] = a
    s = len(a)

    r0 = [i[0] for i in B]
    r0s = np.concatenate((r0))
    idxs = np.searchsorted(a[0],r0s)

    cols = np.array([i.shape[1] for i in B])
    sp = np.r_[0,cols.cumsum()]
    start,stop = sp[:-1],sp[1:]
    for (b_i,s0,s1) in zip(B,start,stop):
        idx = idxs[s0:s1]
        out[s:s+len(b_i)-1,idx] = b_i[1:]
        s += len(b_i)-1
    return out

Answer 2

这是使用np.searchsorted的一种方法：

time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)

a=np.array((time1,data1))
# array([[  1,   2,   3,   4,   5,   6,   7,   8,   9],
#        [118, 105,  86,  94,  69,  17, 142,  46,  54]])

time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
# array([[ 1,  3,  5,  7,  9],
#        [70, 15,  4, 97, 57]])

out = np.vstack([a, np.zeros(a.shape[1])])
out[out.shape[0]-1, np.searchsorted(a[0], b[0])] = b[1]

array([[  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.],
       [118., 105.,  86.,  94.,  69.,  17., 142.,  46.,  54.],
       [ 70.,   0.,  15.,   0.,   4.,   0.,  97.,   0.,  57.]])

更新-合并许多矩阵

这是具有多个b矩阵的场景的几乎完全矢量化的方法。这种方法不需要先验知识即可知道最大的列表：

def merge_timestamps(*x):
    # infer which is the list with maximum length
    # as well as individual lengths
    concat = np.concatenate(*x, axis=1)[0]
    lens = np.r_[np.flatnonzero(np.diff(concat) < 0), len(concat)]
    max_len_list = np.r_[lens[0], np.diff(lens)].argmax()
    # define the output matrix 
    A = x[0][max_len_list]
    out = np.vstack([A[1], np.zeros((len(*x)-1, len(A[0])))])
    others = np.flatnonzero(~np.in1d(np.arange(len(*x)), max_len_list))
    # Update the output matrix with the values of the smaller
    # arrays according to their index. This is of course assuming 
    # all values are contained in the largest
    for ix, i in enumerate(others):
        out[-(ix+1), x[0][i][0]-A[0].min()] = x[0][i][1]
    return out

让我们检查以下示例：

time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)

a=np.array((time1,data1))

# array([[  1,   2,   3,   4,   5,   6,   7,   8,   9],
#        [107,  13, 123, 119, 137, 135,  65, 157,  83]])

time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b = np.array((time2,data2))
# array([[  1,   3,   5,   7,   9],
#        [ 81,  49,  83,  32, 179]])

time3=np.arange(1,4,2)
data3=np.random.randint(200, size=time3.shape)
c=np.array((time3,data3))
# array([[  1,   3],
#        [185, 117]])

merge_timestamps([a,b,c])

array([[  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.],
       [107.,  13., 123., 119., 137., 135.,  65., 157.,  83.],
       [185.,   0., 117.,   0.,   0.,   0.,   0.,   0.,   0.],
       [ 81.,   0.,  49.,   0.,  83.,   0.,  32.,   0., 179.]])

如前所述，此方法不需要先验知识即可获得最大列表，即它也可以与以下方法一起使用：

merge_timestamps([b, c, a])

array([[  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.],
       [107.,  13., 123., 119., 137., 135.,  65., 157.,  83.],
       [185.,   0., 117.,   0.,   0.,   0.,   0.,   0.,   0.],
       [ 81.,   0.,  49.,   0.,  83.,   0.,  32.,   0., 179.]])

Answer 3

仅在传感器以固定间隔捕获数据时适用。首先，我们将需要创建一个具有固定间隔（在这种情况下为15分钟间隔）的数据框，然后使用concat函数对该具有传感器数据的数据框进行操作。

代码以15分钟的间隔生成数据帧（已复制）

l = (pd.DataFrame(columns=['NULL'],
                  index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
                                      freq='15T'))
       .between_time('07:00','21:00')
       .index.strftime('%Y-%m-%dT%H:%M:%SZ')
       .tolist()
)
l = pd.DataFrame(l)

假设以下数据来自传感器

m = (pd.DataFrame(columns=['NULL'],
                  index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
                                      freq='30T'))
       .between_time('07:00','21:00')
       .index.strftime('%Y-%m-%dT%H:%M:%SZ')
       .tolist()
)
m = pd.DataFrame(m)
m['SensorData'] = np.arange(8)

merge在两个数据框上方

df = l.merge(m, left_on = 0, right_on= 0,how='left')
df.loc[df['SensorData'].isna() == True,'SensorData'] = 0

输出

                       0  SensorData
0   2016-09-02T17:30:00Z         0.0
1   2016-09-02T17:45:00Z         0.0
2   2016-09-02T18:00:00Z         1.0
3   2016-09-02T18:15:00Z         0.0
4   2016-09-02T18:30:00Z         2.0
5   2016-09-02T18:45:00Z         0.0
6   2016-09-02T19:00:00Z         3.0
7   2016-09-02T19:15:00Z         0.0
8   2016-09-02T19:30:00Z         4.0
9   2016-09-02T19:45:00Z         0.0
10  2016-09-02T20:00:00Z         5.0
11  2016-09-02T20:15:00Z         0.0
12  2016-09-02T20:30:00Z         6.0
13  2016-09-02T20:45:00Z         0.0
14  2016-09-02T21:00:00Z         7.0

基于第一行合并多维NumPy数组

3 个答案:

对于一个b矩阵的简单情况

对于多个b矩阵