Question

任何人都可以告诉我如何更有效地使用熊猫，目前我正在做以下事情来找出两个项目的相关性，但这不是很快。

for i in range(0, df.shape[0]):
    for j in range(0, df.shape[0]):
        if i<j:
            ##  get the weights
            wgt_i = dataWgt_df.ix[df.index[i]][0]
            wgt_j = dataWgt_df.ix[df.index[j]][0]
            ##  get the std's
            std_i = dataSTD_df.loc[date][df.index[i]][0]
            std_j = dataSTD_df.loc[date][df.index[j]][0]
            ##  get the corvariance
            #print(cor.ix[df.index[i]][df.index[j]])
            cor = corr.ix[df.index[i]][df.index[j]]
            ##  create running total
            totalBottom = totalBottom + (wgt_i * wgt_j * std_i * std_j)
            totalTop = totalTop + (wgt_i * wgt_j * std_i * std_j * cor)

我想要做的是创建一个像这样的单一矩阵

0  1  1  1  1
0  0  1  1  1
0  0  0  1  1
0  0  0  0  1
0  0  0  0  0

然后我可以使用它来遍历各种数据帧，wgt_i wgt_j std_i std_j这将创建一个顶部和底部的数据帧，然后我可以使用sum函数求和并得到结果。

我的主要问题是如何快速创建身份数据框，然后创建wgt_i等数据框，因为其余部分相对简单。

Answer 1

我不是pandas专家，但它似乎适用于numpy。按照这个假设，您可以使用numpy做一些事情来避免双嵌套循环。

伊恩是对的;那不是一个单位矩阵。如果你做想要一个单位矩阵，你只需使用numpy.identity：

import numpy
numpy.identity(5)

array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

但是，如果您想要上面指定的确切矩阵，可以使用numpy.eye：


import numpy
n = 5  # yields a 5x5 array; adjust to whatever size you want
numpy.sum(numpy.eye(n, k=i) for i in range(1,n))

array([[ 0.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  1.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.]])

Answer 2

这并不像@larsbutler的解决方案那么短，但对于大型的n：

要快得多

import numpy as np

n = 5
M = np.zeros((n,n))
M[np.triu_indices_from(M)] = 1
M[np.diag_indices_from(M)] = 0

给出：

array([[ 0.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  1.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.]])

填充没有双循环的python数组

2 个答案: