Question

我有一个大型的pandas数据帧（97165行和2列），我想计算并保存这些列中每100行的相关性，我希望这样：

第一次相关 - ＆gt;从0到100的行 - ＆gt; corr = 0.265

第二相关 - ＆gt;从1到101的行 - ＆gt; corr = 0.279

第三相关 - ＆gt;从2到102的行 - ＆gt; corr = 0.287

每个值都必须存储，并在绘图中显示后，所以我必须将所有这些值保存在列表或类似的内容中。

我一直在阅读与滚动窗口相关的pandas文档 pandas rolling window但我无法取得任何成就。我试图生成一个简单的循环来获得一些结果，但我遇到了内存问题，我尝试的代码是：

lcl = 100
a = []
for i in range(len(tabla)):

    x = tabla.iloc[i:lcl, [0]] 
    y = tabla.iloc[i:lcl, [1]]
    z = x['2015_Avion'].corr(y['2015_Hotel'])
    a.append(z) 
    lcl += 1

有什么建议吗？

Answer 1

我们可以通过使用数组数据来优化内存和性能。

方法＃1

首先，让我们有一个数组解决方案来获取两个1D数组之间相应元素的相关系数。这基本上是受this post的启发，看起来像这样 -

def corrcoeff_1d(A,B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(-1,keepdims=1)
    B_mB = B - B.mean(-1,keepdims=1)

    # Sum of squares
    ssA = np.einsum('i,i->',A_mA, A_mA)
    ssB = np.einsum('i,i->',B_mB, B_mB)

    # Finally get corr coeff
    return np.einsum('i,i->',A_mA,B_mB)/np.sqrt(ssA*ssB)

现在，要使用它，请使用相同的循环，但要使用数组数据 -

lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)
for i in range(N):
    out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])

我们可以通过预先计算用A_mA计算corrcoeff_1d convolution的滚动均值来进一步优化性能，但首先让我们得到内存错误在路上。

方法＃2

这里是一个几乎矢量化的方法，因为我们会对大多数迭代进行矢量化，除了最后没有适当窗口长度的剩余切片。循环计数将从97165减少到lcl-1，即仅99。

lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)

col0_win = strided_app(ar[:,0],lcl,S=1)
col1_win = strided_app(ar[:,1],lcl,S=1)
vectorized_out = corr2_coeff_rowwise(col0_win, col1_win)
M = len(vectorized_out)
out[:M] = vectorized_out

for i in range(M,N):
    out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])

助手功能 -

# https://stackoverflow.com/a/40085052/ @ Divakar
def strided_app(a, L, S ):  # Window len = L, Stride len/stepsize = S
    nrows = ((a.size-L)//S)+1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))

# https://stackoverflow.com/a/41703623/ @Divakar
def corr2_coeff_rowwise(A,B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(-1,keepdims=1)
    B_mB = B - B.mean(-1,keepdims=1)

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA, A_mA)
    ssB = np.einsum('ij,ij->i',B_mB, B_mB)

    # Finally get corr coeff
    return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)

NaN填充数据的相关性

下一步列出了基于Pandas的相关计算的NumPy解决方案，用于计算一维数组和行方向相关值之间的相关性。

1）两个1D阵列之间的标量相关值 -

def nancorrcoeff_1d(A,B):
    # Get combined mask
    comb_mask = ~(np.isnan(A) & ~np.isnan(B))
    count = comb_mask.sum()

    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
    B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count

    # Replace NaNs with zeros, so that later summations could be computed    
    A_mA[~comb_mask] = 0
    B_mB[~comb_mask] = 0

    ssA = np.inner(A_mA,A_mA)
    ssB = np.inner(B_mB,B_mB)

    # Finally get corr coeff
    return np.inner(A_mA,B_mB)/np.sqrt(ssA*ssB)

2）两个2D数组(m,n)之间的逐行关联，为我们提供1D形状数组(m,) -

def nancorrcoeff_rowwise(A,B):
    # Input : Two 2D arrays of same shapes (mxn). Output : One 1D array  (m,)
    # Get combined mask
    comb_mask = ~(np.isnan(A) & ~np.isnan(B))
    count = comb_mask.sum(axis=-1,keepdims=1)

    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
    B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count

    # Replace NaNs with zeros, so that later summations could be computed    
    A_mA[~comb_mask] = 0
    B_mB[~comb_mask] = 0

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA, A_mA)
    ssB = np.einsum('ij,ij->i',B_mB, B_mB)

    # Finally get corr coeff
    return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)

Answer 2

您提到尝试rolling。究竟出了什么问题？这对我有用：

my_res = tabla['2015_Avion'].rolling(100).corr(tabla['2015_Hotel'])

my_res在NaN值之前会有100个值，因此my_res[99]应该是行0和行{{1}之间的相关性这两个列的元素，仅由99 pandas函数返回，仅应用于子集。 corr是行my_res[100]和行1元素之间的相关性。

Python生成滚动窗口以计算相关性

2 个答案: