Question

我有一个数组y，其中包含在每月的给定日期观察到的值。月份中的天在数组x中。

我需要使用三次样条对这些值进行插值，以便我可以获取一个月中每一天的值。为了考虑每月的每一天，我创建了一个数组xd。

如果我想绘制原始y和内插y（即yd），则需要将它们在同一轴上对齐。该轴是考虑每月整天xd的轴。

是否存在一种有效的方法来快速创建一个新的y数组，该数组基于新的x轴在正确的位置完全包含原始y元素，而所有其他元素都填充有零或NaN（最好是）？

例如，我的第一个y仅在第2天可用，因此在新的y数组中，我需要第一个元素显示0 / NaN。然后第二个元素将显示原始y = 11，第三个元素将显示NaN，依此类推。

我已经编写了这段代码，该代码完成了我上面提到的内容，但是我不知道是否有更好/更快的方法来实现此目的。在许多情况下，数组比我在下面的示例中显示的要大得多，因此拥有一些有效的算法将有所帮助。谢谢。

import numpy as np
import scipy.interpolate as sp

x = [2, 5, 7, 11, 13, 16, 19, 23, 25, 30]
y = [11, 10, 12, 14, 16, 19, 17, 14, 18, 17]

xd = np.linspace(0, max(x), int(max(x))+1) # create the new x axis
ipo = sp.splrep(x, y, k=3) # cubic spline
yd = sp.splev(xd, ipo) # interpolated y values

newY = np.zeros((1, len(yd)), dtype=float) # preallocate for the filled y values

for i in x: 
    if(i in xd): 
        idx, = np.where(xd == i) # find where the original x value is in the new x axis
        idx2, = np.where(np.array(x) == i)
        newY[0, int(idx)] = y[int(idx2)] # replace the y value of the new vector with the y value from original set

编辑：

仅需说明的是，需要有一组对齐的数组（它们都共享相同的轴）是因为当我绘制两个数组（newY和yd）时，我还添加了一些子图，其中取了绝对和相对差看适合度有多好。

我知道在这种情况下，样条曲线将始终通过我作为输入提供的所有点，因此差异将为零，但是下面的绘图函数应该适用于任何类型的比较（即，任何类型的内插值与实际输入）。我使用的绘图功能如下：

def plotInterpolatedVsReal(xaxis, yaxis1, yaxis2, xlab='Dates', mainTitle='', width=25, zero2nan=True):
    if(zero2nan):
        yaxis1[yaxis1 == 0] = np.nan
        yaxis2[yaxis2 == 0] = np.nan

    fix, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, figsize=(10, 10))
    ax1.plot(xaxis, yaxis1, label='Interpolated')
    ax1.plot(xaxis, yaxis2, 'ro', label='Input')
    ax1.set_ylabel('Prices')
    ax1.legend(loc=0)
    ax2.bar(xaxis, yaxis1 - yaxis2, width=width)
    ax2.axhline(y=0, linewidth=1, color='k')
    ax2.set_ylabel('Errors [diff]')
    ax3.bar(xaxis, 100*(yaxis1/yaxis2 - 1), width=width)
    ax3.axhline(y=0, linewidth=1, color='k')
    ax3.set_ylabel('Errors [%]')
    ax3.set_xlabel(xlab);
    plt.suptitle(mainTitle)

编辑2：

添加到目前为止该提案的绩效指标。我的循环（方法A）更快，因为它仅在x向量上循环，而其他2种方法在xd上循环，而xd可能更大。在我的情况下，x具有23个元素，而xd具有3655个元素。

def A():
    for i in x: 
        if(i in xd): 
            idx, = np.where(xd == i) # find where the original x value is in the new x axis
            idx2, = np.where(np.array(x) == i)
            newY[int(idx)] = y[int(idx2)] # replace the y value of the new vector with the y value from original set 

def B():
    for i, date in enumerate(xd):
        if date in x:
            new_y[i] = date

def C(): 
    known_values = dict(zip(x, y))

    for i,u in enumerate(xd):
        if u in known_values:
            newY[i] = known_values[u]

％timeit A（）每个循环219 µs±8.8 µs（平均±标准偏差，共运行7次，每个循环1000个）

％timeit B（）每个循环8.87 ms±95.3 µs（平均±标准偏差，共运行7次，每个循环100个）

％timeit C（）每个循环408 µs±11.3 µs（平均±标准偏差，共运行7次，每个循环1000个）

我还试图将A（）函数传递给Numba进行JIT编译：

A_nb = numba.jit(A)

获取：

％timeit A_nb（）每个循环226 µs±610 ns（平均±标准偏差，共运行7次，每个循环1000个）

Answer 1

我知道所有这些的目的是在同一张图上绘制y值，为什么不直接这样做呢？轴可以像这样轻松地在同一图上处理不同的x轴：

import numpy as np
import scipy.interpolate as sp
import matplotlib.pyplot as plt

x = [2, 5, 7, 11, 13, 16, 19, 23, 25, 30]
y = [11, 10, 12, 14, 16, 19, 17, 14, 18, 17]

xd = np.linspace(0, max(x), int(max(x)) + 1)  # create the new x axis
ipo = sp.splrep(x, y, k=3)  # cubic spline
yd = sp.splev(xd, ipo)  # interpolated y values

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, y, label='Original')
ax.plot(xd, yd, label='Interpolated')
plt.legend()
plt.grid()

plt.show()

根据需要，每个“ y”数据都与其自身的x轴对齐，而无需进行任何预处理。这里唯一要做的插值是Matplotlib用于显示目的的插值。

由于您确实需要用Nan来填充数组，因此这是一种可行的方法：

new_y = np.NAN * np.zeros(yd.shape)
for i, date in enumerate(xd):
    if date in x:
        new_y[i] = date

其中一种花式衬垫可能可以减少

Answer 2

很抱歉，如果我完全误解了您的代码，但是Codable不仅仅是一种写np.linspace(0, max(x), int(max(x))+1)的round回方式吗？似乎您只是在np.array(range(1+max(x)))和1+max(x)之间（包括两端）取0线性间隔的样本，这与只取0到max（x）之间的整数相同

在这种情况下，是否有必要这样做？

max(x)

如果xd实际上只是一个从0到max（x）的整数列表，则if(i in xd): idx, = np.where(xd == i) # find where the original x value is in the new x axis中的所有元素根据定义都将位于x中，并且xd应该始终为等于idx。

（当然，i仅包含非负整数值。）

编辑：在更普遍的情况下，新轴不仅仅是简单的整数范围0..max（x），我建议将已知值转换成字典后，在数组上进行迭代。由于将线性搜索替换为字典查找，因此这样会更有效率。

xd = np.array(range(1+max(x)))
newY = np.zeros(len(xd))

for i,j in zip(x, y):
    newY[i] = j

编辑：有趣的是，性能要差得多-如果已知值足够少（这显然会发生（然后遍历大数组则要昂贵得多）），但我认为这实际上不是问题

还有另一种利用两种排序的循环方式，但是它用显式循环替换了np.where，我不确定它是否实际上更有效，这取决于对本机numpy代码的优化程度：

known_values = dict(zip(x, y))

xd = [... your new axis ...]
newY = np.zeros(len(xd))

for i,x in enumerate(xd):
    if x in known_values:
        newY[i] = known_values[x]

如何在Python中对齐具有不同长度的两个数组（在没有匹配元素的情况下使用NaN）

2 个答案: