Question

我处理一些大型数据集 - 作为时间函数的观察 - 在时间上不连续（即，存在大量缺失数据，其中缺少完整记录）。为了让事情变得有趣，有很多数据集，都有丢失的记录，都在随机的地方......

我不知何故需要获取数据＆＃34;同步＆＃34;及时，将丢失的数据标记为缺失数据，而不是完全缺席。我设法让这部分工作，但我还有一些问题。

示例：

import numpy as np

# The date range (in the format that I'm dealing with), which I define
# myself for the period in which I'm interested
dc = np.arange(2010010100, 2010010106)

# Observation dates (d1) and values (v1)
d1  = np.array([2010010100, 2010010104, 2010010105]) # date
v1  = np.array([10,         11,         12        ]) # values

# Another data set with (partially) other times
d2  = np.array([2010010100, 2010010102, 2010010104]) # date
v2  = np.array([13,         14,         15        ]) # values

# For now set -1 as fill_value
v1_filled = -1 * np.ones_like(dc)
v2_filled = -1 * np.ones_like(dc)

v1_filled[dc.searchsorted(d1)] = v1
v2_filled[dc.searchsorted(d2)] = v2

这给了我想要的结果：

v1_filled = [10 -1 -1 -1 11 12]
v2_filled = [13 -1 14 -1 15 -1]

但仅当d1或d2中的值也在dc时;如果d1或d2中的值不在dc中，则代码会失败，因为searchsorted表现为：

如果没有合适的索引，则返回0或N（其中N是a的长度）。

例如，如果我将d2和v2更改为：

d2  = np.array([2010010100, 2010010102, 2010010104, 0]) # date
v2  = np.array([13,         14,         15,         9999]) # values

结果是

[9999   -1   14   -1   15   -1]

在这种情况下，因为d2=0不在dc中，所以它应该丢弃该值，而不是在开头（或结束）插入它。知道如何轻松实现这一目标吗？

Answer 1

如果您在致电d2 = np.intersect1d(dc, d2)之前dc.searchsorted(d2)，则会删除d2中非直流电的所有元素。

使用缺失值

1 个答案: