如何将线性回归应用于包含NaN的大型多维数组中的每个像素?

时间:2018-08-31 04:26:41

标签: python arrays numpy scipy python-xarray

我有一个独立变量值(x_array)的一维数组,它们与具有多个时间步长(y_array)的空间数据3D numpy数组中的时间步长匹配。我的实际数据要大得多:300多个时间步长,最大3000 * 3000像素:

import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

我想计算每个像素的线性回归并获得xy中每个y_array像素的R平方,P值,截距和斜率,以及{{1 }}作为我的自变量。

我可以整形以某种形式获取数据,然后将其输入到x_array中,该向量经过矢量化处理后很快:

np.polyfit

但是,这会忽略包含任何# Reshape so rows = number of time-steps and columns = pixels: y_array_reshaped = y_array.reshape(len(y_array), -1) # Do a first-degree polyfit np.polyfit(x_array, y_array_reshaped, 1) 值的像素(NaN不支持np.polyfit值),并且不会计算我需要的统计信息(R平方,P值,截距和坡度)。

answer here使用NaN来计算我需要的统计数据,并建议通过屏蔽这些scipy.stats import linregress值来避免出现NaN问题。但是,此示例适用于两个一维数组,我无法弄清楚如何对NaN中的每一列具有不同的y_array_reshaped值集的情况应用类似的屏蔽方法。 / p>

我的问题:如何以合理的矢量化方式为包含多个NaN值的大型多维数组(300 x 3000 x 3000)中的每个像素计算回归统计?

所需结果NaN中每个像素的3 x 3回归统计值数组(例如R平方),即使该像素在y_array处包含NaN时间序列中的某个点

4 个答案:

答案 0 :(得分:3)

这里https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html给出的答案是绝对好的,因为它主要利用了numpy矢量化和广播的强大功能,但是它假定要分析的数据是完整的,通常情况下并非如此。真正的研究周期。上面的一个答案旨在解决丢失的数据问题,但我个人认为,仅由于np.mean()将返回nan(如果数据中包含nan),则需要更新更多代码。幸运的是,numpy提供了nanmean()nanstd()等,以便我们通过忽略数据中的nans来计算均值,标准误等。同时,原始博客中的程序针对数据格式为netCDF的数据。有些人可能不知道这一点,但是对原始numpy.array格式更加熟悉。因此,我在这里提供一个代码示例,该示例显示了如何计算两个3D维数组(n维维具有相同的逻辑)之间的协方差,相关系数等。请注意,为方便起见,我将x_array设为y_array的第一维索引,但在实际分析中肯定可以从外部读取x_array

代码

def linregress_3D(y_array):
    # y_array is a 3-D array formatted like (time,lon,lat)
    # The purpose of this function is to do linear regression using time series of data over each (lon,lat) grid box with consideration of ignoring np.nan
    # Construct x_array indicating time indexes of y_array, namely the independent variable.
    x_array=np.empty(y_array.shape)
    for i in range(y_array.shape[0]): x_array[i,:,:]=i+1 # This would be fine if time series is not too long. Or we can use i+yr (e.g. 2019).
    x_array[np.isnan(y_array)]=np.nan
    # Compute the number of non-nan over each (lon,lat) grid box.
    n=np.sum(~np.isnan(x_array),axis=0)
    # Compute mean and standard deviation of time series of x_array and y_array over each (lon,lat) grid box.
    x_mean=np.nanmean(x_array,axis=0)
    y_mean=np.nanmean(y_array,axis=0)
    x_std=np.nanstd(x_array,axis=0)
    y_std=np.nanstd(y_array,axis=0)
    # Compute co-variance between time series of x_array and y_array over each (lon,lat) grid box.
    cov=np.nansum((x_array-x_mean)*(y_array-y_mean),axis=0)/n
    # Compute correlation coefficients between time series of x_array and y_array over each (lon,lat) grid box.
    cor=cov/(x_std*y_std)
    # Compute slope between time series of x_array and y_array over each (lon,lat) grid box.
    slope=cov/(x_std**2)
    # Compute intercept between time series of x_array and y_array over each (lon,lat) grid box.
    intercept=y_mean-x_mean*slope
    # Compute tstats, stderr, and p_val between time series of x_array and y_array over each (lon,lat) grid box.
    tstats=cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
    stderr=slope/tstats
    from scipy.stats import t
    p_val=t.sf(tstats,n-2)*2
    # Compute r_square and rmse between time series of x_array and y_array over each (lon,lat) grid box.
    # r_square also equals to cor**2 in 1-variable lineare regression analysis, which can be used for checking.
    r_square=np.nansum((slope*x_array+intercept-y_mean)**2,axis=0)/np.nansum((y_array-y_mean)**2,axis=0)
    rmse=np.sqrt(np.nansum((y_array-slope*x_array-intercept)**2,axis=0)/n)
    # Do further filteration if needed (e.g. We stipulate at least 3 data records are needed to do regression analysis) and return values
    n=n*1.0 # convert n from integer to float to enable later use of np.nan
    n[n<3]=np.nan
    slope[np.isnan(n)]=np.nan
    intercept[np.isnan(n)]=np.nan
    p_val[np.isnan(n)]=np.nan
    r_square[np.isnan(n)]=np.nan
    rmse[np.isnan(n)]=np.nan
    return n,slope,intercept,p_val,r_square,rmse

样本输出

我已经使用该程序测试了两个227x3601x6301像素的3-D阵列,它在20分钟内完成了工作,每个少于10分钟。

答案 1 :(得分:1)

我不确定这将如何扩展(也许您可以使用dask),但是这是使用apply用熊猫DataFrame做到这一点的一种非常简单的方法方法:

import pandas as pd
import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

def lin_regress(col):
    "Mask nulls and apply stats.linregress"
    col = col.loc[~pd.isnull(col)]
    return linregress(col.index.tolist(), col)

# Build the DataFrame (each index represents a pixel)
df = pd.DataFrame(y_array.reshape(len(y_array), -1), index=x_array.tolist())

# Apply a our custom linregress wrapper to each function, split the tuple into separate columns
final_df = df.apply(lin_regress).apply(pd.Series)

# Name the index and columns to make this easier to read
final_df.columns, final_df.index.name = 'slope, intercept, r_value, p_value, std_err'.split(', '), 'pixel_number'

print(final_df)

输出:

                 slope  intercept   r_value       p_value   std_err
pixel_number                                                       
0             0.071429  -0.192857  0.188982  8.789623e-01  0.371154
1            -0.071429  -0.207143 -0.188982  8.789623e-01  0.371154
2             0.357143  -0.464286  0.944911  2.122956e-01  0.123718
3             0.105263  -0.289474  0.229416  7.705843e-01  0.315789
4             1.000000  -0.700000  1.000000  9.003163e-11  0.000000
5            -0.285714  -0.328571 -0.188982  8.789623e-01  1.484615
6             0.105263  -0.289474  0.132453  8.675468e-01  0.557000
7            -0.285714  -0.228571 -0.755929  4.543711e-01  0.247436
8             0.071429  -0.392857  0.188982  8.789623e-01  0.371154

答案 2 :(得分:1)

在numpy级别,您可以使用np.vectorize

首先为每组数据定义棘手的部分:

def compute(x,y):
        mask=~np.isnan(y)
        return linregress(x[mask],y[mask])

然后定义将处理数据的函数:

comp = np.vectorize(compute,signature="(k),(k)->(),(),(),(),()")

然后应用,按照广播规则重新组织数据:

res = comp(x_array,rollaxis(y_array,0,3))

最后,

final=np.dstack(res) 

现在final[i,j]包含linregress返回的像素(i,j)的五个参数。

它与pandas方法的答案大致相当,但快了2.5倍。
300x(100x100图像)问题大约需要5秒钟,因此请数小时。我认为做不到更好的方法并不容易,因为时间实际上是在linregress函数中花费的。

答案 3 :(得分:1)

上面评论中提到的这篇博客文章包含了一个非常快的矢量化函数,用于在Python中对多维数据进行互相关,协方差和回归。它产生了我需要的所有回归输出,并且以毫秒为单位,因为它完全依赖于xarray中的简单向量化数组操作。

https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html

我做了一个较小的更改(#3之后的第一行),以确保该函数正确地说明了每个像素中不同数量的NaN值:

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr