Question

我的数据如下：

timedelta64 1, temp1A, temp 1B, temp1C, ...
timedelta64 2, temp2A, temp 2B, temp2C, ...

数据被摄取到两个numpy数组中：

一系列时间加盖raw_timestamp，dtype=[('datetime', '<M8[s]')]

'2009-01-01T18:41:00', 
'2009-01-01T18:44:00',
'2009-01-01T18:46:00', 
'2009-01-01T18:47:00',

传感器数据表raw_sensor，dtype=[ ('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')]
```
 (755, 855, 755, 855, 743, 843, 743, 843, 2),
 (693, 793, 693, 793, 693, 793, 693, 793, 1),
 (755, 855, 755, 855, 743, 843, 743, 843, 2),
 (693, 793, 693, 793, 693, 793, 693, 793, 1),
```

我生成了一个新的filled_timestamp，并在每个时间步的每一行填写时间戳：filled_timestamp = np.arange(np.datetime64(starttime), np.datetime64(endtime), np.timedelta64(interval))

使用idxs = np.in1d(filled_timestamp,raw_timestamp)，filled的所有索引都与raw的时间戳匹配。所以我可以为filled_sensor分配来自raw_sensor

filled_sensor[idxs] = raw_sensor

Q1。是否有更好/更快的方式来交叉这些？

现在filled数组看起来像：

>>> filled_timestamp, filled_sensor # shown side-by-side for convenience 
    array([ 
      1 #  ('2009-01-01T18:41:00')  (755, 855, 755, 855, 743, 843, 743, 843, 2),
      2 #  ('2009-01-01T18:42:00')  (0, 0, 0, 0, 0, 0, 0, 0, 0),
      3 #  ('2009-01-01T18:43:00')  (0, 0, 0, 0, 0, 0, 0, 0, 0),
      4 #  ('2009-01-01T18:44:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1),
      5 #  ('2009-01-01T18:45:00')  (0, 0, 0, 0, 0, 0, 0, 0, 0),
      6 #  ('2009-01-01T18:46:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1),
      7 #  ('2009-01-01T18:47:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1)
       ],
          dtype=[('datetime', '<M8[s]')], [('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')]

Q2。如何使用第一个前一个非空行的值填充缺失的行？除了列（0和3以及最后一个），填充为0

在上面的例子中：

第2行和第3行将获取第1行的值，

第5行将从第4行中获取值

最终结果：

>>> filled_timestamp, filled_sensor # shown side-by-side for convenience 
    array([ 
      1 #  ('2009-01-01T18:41:00')  (755, 855, 755, 855, 743, 843, 743, 843, 2),
      2 #  ('2009-01-01T18:42:00')  (0, 855, 755, 0, 743, 843, 743, 843, 0),
      3 #  ('2009-01-01T18:43:00')  (0, 855, 755, 0, 743, 843, 743, 843, 0),
      4 #  ('2009-01-01T18:44:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1),
      5 #  ('2009-01-01T18:45:00')  (0, 793, 693, 0, 693, 793, 693, 793, 0),
      6 #  ('2009-01-01T18:46:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1),
      7 #  ('2009-01-01T18:47:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1)
       ],
          dtype=[('datetime', '<M8[s]')], [('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')]

Answer 1

<强>交叉口

快速交叉的最佳选择可能是np.searchsorted。它会在filled_timestamp中对raw_timestamp：

的元素进行二进制搜索

idx = np.searchsorted(filled_timestamp, raw_timestamp)

如果raw_timestamp中的每个元素实际出现在filled_timestamp中，这只会是准确的，因为np.searchsorted无论如何都会返回插入索引。

非矢量化解决方案

您希望将filled_sensor的{{1}}切片设置为idx[n]的值idx[n + 1]：

raw_sensor[n]

我在这里使用from itertools import zip_longest for start, end, row in zip_longest(idx, idx[1:], raw_sensor): filled_sensor[start:end] = row，以便来自zip_longest的最后一个值为idx[1:]，使最后一个切片相当于None，而不需要特殊条件。

矢量化解决方案

如果您知道要从filled_sensor[idx[-1]:]重复哪些索引，则可以直接从filled_sensor一次创建raw_sensor。您可以通过将np.cumsum应用于转换为布尔数组的raw_sensor来获取该信息：

idx

基本上，我们从与idx_mask = np.zeros(filled_timestamp.shape, np.bool) idx_mask[idx] = True相同大小的布尔数组开始，filled_timestamp（1）来自True的条目匹配。我们可以通过计算到目前为止发生了多少次匹配来将其转换为raw_timestamp中的索引：

raw_timestamp

请记住，indexes = np.cumsum(idx_mask) - 1是一个整数数组，而不是布尔值。每当找到新匹配时它都会递增。 indexes从count转换为index，因为第一个匹配的计数为1而不是0。

现在你可以制作- 1：

filled_sensor

此处唯一可能的警告是filled_sensor = raw_sensor[indexes]不是来自filled_sensor[0]。然后将其替换为raw_sensor[0]。鉴于您如何根据raw_sensor[-1]在filled中构建时间，我不确定是否会成为问题。

示例

以下是交集和矢量化解决方案的示例，其中包含您在问题中显示的数据。

我们从
开始
raw

我们可以生成raw_timestamp = np.array(['2009-01-01T18:41:00', '2009-01-01T18:44:00', '2009-01-01T18:46:00', '2009-01-01T18:47:00',], dtype='datetime64[s]') raw_sensor = np.array([(755, 855, 755, 855, 743, 843, 743, 843, 2), (693, 793, 693, 793, 693, 793, 693, 793, 1), (755, 855, 755, 855, 743, 843, 743, 843, 2), (693, 793, 693, 793, 693, 793, 693, 793, 1),], dtype=[('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')])

filled_timestamp

正如预期的那样产出：

filled_timestamp = np.arange('2009-01-01T18:41:00', '2009-01-01T18:48:00', 60, dtype='datetime64[s]')

我通过使用时间戳普通数组而不是结构化数组对array(['2009-01-01T18:41:00', '2009-01-01T18:42:00', '2009-01-01T18:43:00', '2009-01-01T18:44:00', '2009-01-01T18:45:00', '2009-01-01T18:46:00', '2009-01-01T18:47:00'], dtype='datetime64[s]')采取了轻微的自由，但我认为这对你的目的没有任何影响。

dtypes收益

idx = np.searchsorted(filled_timestamp, raw_timestamp)

这意味着idx = np.array([0, 3, 5, 6], dtype=np.int)中的索引0, 3, 5, 6与filled_timestamp的值匹配。

raw_timestamp然后变成

idx_mask

这基本上与idx_mask = np.array([True, False, False, True, False, True, True], dtype=np.bool)同义，除了扩展为与idx大小相同的布尔掩码。

现在棘手的部分：filled_timestamp：

indexes = np.cumsum(idx_mask) - 1

这可解释如下：indexes = array([0, 0, 0, 1, 1, 2, 3], dtype=np.int)应来自filled_sensor[0:3]。 raw_sensor[0]应来自filled_sensor[3:5]，raw_sensor[1]来自filled_sensor[5]，raw_sensor[2]应来自filled_sensor[6]。

现在我们使用raw_sensor[3]使用indexes直接提取raw_sensor的正确元素：

filled_sensor = raw_sensor[indexes]

numpy时间序列合并并用较早的值填充缺失值

1 个答案: