我必须处理传感器数据(特别是来自ros,但不相关)。为此,我有几个二维numpy数组,其中一行存储时间戳,其后的一行存储相应的传感器数据。问题是,这样的数组没有相同的尺寸(不同的采样时间)。我需要将所有这些数组合并为一个大数组。如何根据时间戳进行操作,例如将丢失的数字替换为0或NaN?
我的情况示例:
import numpy as np
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
print(a)
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
print(b)
哪个返回输出
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]]
[[ 1 3 5 7 9]
[ 35 188 114 153 36]]
我要找的是
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]
[ 35 0 188 0 114 0 153 0 36]]
有什么办法可以有效地实现这一目标?这是一个示例,但我正在处理数千个示例。谢谢!
答案 0 :(得分:1)
在a
的第一行存储所有可能的时间戳记的同时,a
和b
中的第一行均已排序,我们可以使用np.searchsorted
-
idx = np.searchsorted(a[0],b[0])
out_dtype = np.result_type((a.dtype,b.dtype))
b0 = np.zeros(a.shape[1],dtype=out_dtype)
b0[idx] = b[1]
out = np.vstack((a,b0))
方法1
要扩展到多个 b矩阵,我们可以在循环内对np.searchsorted
使用类似的方法,如下所示-
def merge_arrays(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
for b_i in B:
idx = np.searchsorted(a[0],b_i[0])
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out
样品运行-
In [175]: a
Out[175]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
In [176]: b0
Out[176]:
array([[16, 22, 34, 56, 67, 91],
[20, 80, 69, 79, 47, 64],
[82, 88, 49, 29, 19, 19]])
In [177]: b1
Out[177]:
array([[ 4, 16, 34, 99],
[28, 34, 0, 0],
[36, 53, 5, 38],
[17, 79, 4, 42]])
In [178]: merge_arrays(a, [b0,b1])
Out[178]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 0, 0, 20, 80, 69, 79, 47, 0, 64, 0],
[ 0, 0, 82, 88, 49, 29, 19, 0, 19, 0],
[28, 0, 34, 0, 0, 0, 0, 0, 0, 0],
[36, 0, 53, 0, 5, 0, 0, 0, 0, 38],
[17, 0, 79, 0, 4, 0, 0, 0, 0, 42]])
方法2
如果似乎使用np.searchsorted
循环是瓶颈,我们可以矢量化该部分-
def merge_arrays_v2(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
r0 = [i[0] for i in B]
r0s = np.concatenate((r0))
idxs = np.searchsorted(a[0],r0s)
cols = np.array([i.shape[1] for i in B])
sp = np.r_[0,cols.cumsum()]
start,stop = sp[:-1],sp[1:]
for (b_i,s0,s1) in zip(B,start,stop):
idx = idxs[s0:s1]
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out
答案 1 :(得分:1)
这是使用np.searchsorted
的一种方法:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [118, 105, 86, 94, 69, 17, 142, 46, 54]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [70, 15, 4, 97, 57]])
out = np.vstack([a, np.zeros(a.shape[1])])
out[out.shape[0]-1, np.searchsorted(a[0], b[0])] = b[1]
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[118., 105., 86., 94., 69., 17., 142., 46., 54.],
[ 70., 0., 15., 0., 4., 0., 97., 0., 57.]])
更新-合并许多矩阵
这是具有多个b
矩阵的场景的几乎完全矢量化的方法。这种方法不需要先验知识即可知道最大的列表:
def merge_timestamps(*x):
# infer which is the list with maximum length
# as well as individual lengths
concat = np.concatenate(*x, axis=1)[0]
lens = np.r_[np.flatnonzero(np.diff(concat) < 0), len(concat)]
max_len_list = np.r_[lens[0], np.diff(lens)].argmax()
# define the output matrix
A = x[0][max_len_list]
out = np.vstack([A[1], np.zeros((len(*x)-1, len(A[0])))])
others = np.flatnonzero(~np.in1d(np.arange(len(*x)), max_len_list))
# Update the output matrix with the values of the smaller
# arrays according to their index. This is of course assuming
# all values are contained in the largest
for ix, i in enumerate(others):
out[-(ix+1), x[0][i][0]-A[0].min()] = x[0][i][1]
return out
让我们检查以下示例:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [107, 13, 123, 119, 137, 135, 65, 157, 83]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b = np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [ 81, 49, 83, 32, 179]])
time3=np.arange(1,4,2)
data3=np.random.randint(200, size=time3.shape)
c=np.array((time3,data3))
# array([[ 1, 3],
# [185, 117]])
merge_timestamps([a,b,c])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])
如前所述,此方法不需要先验知识即可获得最大列表,即它也可以与以下方法一起使用:
merge_timestamps([b, c, a])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])
答案 2 :(得分:0)
仅在传感器以固定间隔捕获数据时适用。
首先,我们将需要创建一个具有固定间隔(在这种情况下为15分钟间隔)的数据框,然后使用concat
函数对该具有传感器数据的数据框进行操作。
代码以15分钟的间隔生成数据帧(已复制)
l = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='15T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
l = pd.DataFrame(l)
假设以下数据来自传感器
m = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='30T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
m = pd.DataFrame(m)
m['SensorData'] = np.arange(8)
merge
在两个数据框上方
df = l.merge(m, left_on = 0, right_on= 0,how='left')
df.loc[df['SensorData'].isna() == True,'SensorData'] = 0
输出
0 SensorData
0 2016-09-02T17:30:00Z 0.0
1 2016-09-02T17:45:00Z 0.0
2 2016-09-02T18:00:00Z 1.0
3 2016-09-02T18:15:00Z 0.0
4 2016-09-02T18:30:00Z 2.0
5 2016-09-02T18:45:00Z 0.0
6 2016-09-02T19:00:00Z 3.0
7 2016-09-02T19:15:00Z 0.0
8 2016-09-02T19:30:00Z 4.0
9 2016-09-02T19:45:00Z 0.0
10 2016-09-02T20:00:00Z 5.0
11 2016-09-02T20:15:00Z 0.0
12 2016-09-02T20:30:00Z 6.0
13 2016-09-02T20:45:00Z 0.0
14 2016-09-02T21:00:00Z 7.0