填补numpy数组中的空白

时间:2011-04-05 11:46:50

标签: python numpy matplotlib scipy interpolation

我只想用最简单的术语插入3D数据集。线性插值,最近邻,所有这些就足够了(这是从一些算法开始,所以不需要准确的估计)。

在新的scipy版本中,像griddata这样的东西会很有用,但是目前我只有scipy 0.8。所以我有一个“立方体”(data[:,:,:],(NixNjxNk))数组,以及一个相同大小的标志数组(flags[:,:,:,]TrueFalse)。我想插入数据元素的数据,其中flag的对应元素为False,例如数据中最近的有效数据点,或者“close by”点的线性组合。

数据集中至少有两个维度可能存在较大间隙。除了使用kdtrees或类似函数对完整的最近邻算法进行编码之外,我无法真正找到通用的N维最近邻插值器。

5 个答案:

答案 0 :(得分:28)

使用scipy.ndimage,您的问题可以用最近邻插值解决,分为两行:

from scipy import ndimage as nd

indices = nd.distance_transform_edt(invalid_cell_mask, return_distances=False, return_indices=True)
data = data[tuple(ind)]

现在,以函数的形式:

import numpy as np
from scipy import ndimage as nd

def fill(data, invalid=None):
    """
    Replace the value of invalid 'data' cells (indicated by 'invalid') 
    by the value of the nearest valid data cell

    Input:
        data:    numpy array of any dimension
        invalid: a binary array of same shape as 'data'. 
                 data value are replaced where invalid is True
                 If None (default), use: invalid  = np.isnan(data)

    Output: 
        Return a filled array. 
    """    
    if invalid is None: invalid = np.isnan(data)

    ind = nd.distance_transform_edt(invalid, 
                                    return_distances=False, 
                                    return_indices=True)
    return data[tuple(ind)]

使用的例子:

def test_fill(s,d):
     # s is size of one dimension, d is the number of dimension
    data = np.arange(s**d).reshape((s,)*d)
    seed = np.zeros(data.shape,dtype=bool)
    seed.flat[np.random.randint(0,seed.size,int(data.size/20**d))] = True

    return fill(data,-seed), seed

import matplotlib.pyplot as plt
data,seed  = test_fill(500,2)
data[nd.binary_dilation(seed,iterations=2)] = 0   # draw (dilated) seeds in black
plt.imshow(np.mod(data,42))                       # show cluster

结果: enter image description here

答案 1 :(得分:14)

您可以设置晶体生长式算法,沿每个轴交替移动视图,仅替换标有False但具有True邻居的数据。这给出了一个“最接近邻居”的结果(但不是欧几里德或曼哈顿距离 - 我认为如果你计算像素,它可能是最近邻,计算所有连接像素与公共角点)这对于NumPy应该是相当有效的因为它只迭代轴和收敛迭代,而不是数据的小片。

原油,快速而稳定。我认为这就是你所追求的:

import numpy as np
# -- setup --
shape = (10,10,10)
dim = len(shape)
data = np.random.random(shape)
flag = np.zeros(shape, dtype=bool)
t_ct = int(data.size/5)
flag.flat[np.random.randint(0, flag.size, t_ct)] = True
# True flags the data
# -- end setup --

slcs = [slice(None)]*dim

while np.any(~flag): # as long as there are any False's in flag
    for i in range(dim): # do each axis
        # make slices to shift view one element along the axis
        slcs1 = slcs[:]
        slcs2 = slcs[:]
        slcs1[i] = slice(0, -1)
        slcs2[i] = slice(1, None)

        # replace from the right
        repmask = np.logical_and(~flag[slcs1], flag[slcs2])
        data[slcs1][repmask] = data[slcs2][repmask]
        flag[slcs1][repmask] = True

        # replace from the left
        repmask = np.logical_and(~flag[slcs2], flag[slcs1])
        data[slcs2][repmask] = data[slcs1][repmask]
        flag[slcs2][repmask] = True

为了更好地衡量,这里是由最初标记为True的数据播种的区域的可视化(2D)。

enter image description here

答案 2 :(得分:2)

前段时间我为我的博士写了这个剧本:https://github.com/Technariumas/Inpainting

示例:http://blog.technariumas.lt/post/117630308826/healing-holes-in-python-arrays

慢,但工作。高斯内核是最佳选择,只需检查size / sigma值。

答案 3 :(得分:1)

您可以尝试解决您的问题:

# main ideas described in very high level pseudo code
choose suitable base kernel shape and type (gaussian?)
while true
    loop over your array (moving average manner)
        adapt your base kernel to current sparsity pattern
        set current value based on adapted kernel
    break if converged

这实际上可以非常简单地实现(特别是如果性能不是最受关注的话)。

显然这只是启发式方法,您需要对实际数据进行一些实验才能找到合适的适应方案。当将内核自适应视为内核重新加权时,您可能希望根据值的传播方式来实现。例如,原始支撑的权重为1,它们与它们出现的迭代相关。

此过程确定何时实际收敛的确定可能是棘手的。取决于应用,最终可能留下一些间隙区域是合理的。保持'未填充'。

更新:这是上面描述的一个非常简单的实现*):

from numpy import any, asarray as asa, isnan, NaN, ones, seterr
from numpy.lib.stride_tricks import as_strided as ast
from scipy.stats import nanmean

def _a2t(a):
    """Array to tuple."""
    return tuple(a.tolist())

def _view(D, shape, strides):
    """View of flattened neighbourhood of D."""
    V= ast(D, shape= shape, strides= strides)
    return V.reshape(V.shape[:len(D.shape)]+ (-1,))

def filler(A, n_shape, n_iter= 49):
    """Fill in NaNs from mean calculated from neighbour."""
    # boundary conditions
    D= NaN* ones(_a2t(asa(A.shape)+ asa(n_shape)- 1), dtype= A.dtype)
    slc= tuple([slice(n/ 2, -(n/ 2)) for n in n_shape])
    D[slc]= A

    # neighbourhood
    shape= _a2t(asa(D.shape)- asa(n_shape)+ 1)+ n_shape
    strides= D.strides* 2

    # iterate until no NaNs, but not more than n iterations
    old= seterr(invalid= 'ignore')
    for k in xrange(n_iter):
        M= isnan(D[slc])
        if not any(M): break
        D[slc][M]= nanmean(_view(D, shape, strides), -1)[M]
    seterr(**old)
    A[:]= D[slc]

关于行动的filler(.)的简单演示将类似于:

In []: x= ones((3, 6, 99))
In []: x.sum(-1)
Out[]:
array([[ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.]])
In []: x= NaN* x
In []: x[1, 2, 3]= 1
In []: x.sum(-1)
Out[]:
array([[ nan,  nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan]])
In []: filler(x, (3, 3, 5))
In []: x.sum(-1)
Out[]:
array([[ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.]])

*)所以这里nanmean(.)仅用于演示适应过程的想法。基于此演示,实施更复杂的适应和衰减称重方案应该非常简单。另请注意,没有注意实际执行性能,但它仍然应该是好的(具有合理的输入形状)。

答案 4 :(得分:0)

也许您正在寻找的是机器学习算法,如神经网络或支持向量机。

您可以查看此页面,其中包含指向python的SVM包的一些链接:http://web.media.mit.edu/~stefie10/technical/pythonml.html