洗牌行的大熊猫DataFrame和一系列的相关性

时间:2017-12-14 13:02:44

标签: performance pandas numpy permutation correlation

我需要多次独立地对每一行大型pandas DataFrame进行洗牌(典型形状为(10000,1000)),然后估计每一行与给定系列的相关性。

我发现留在熊猫中的最有效(=快速)方式如下:

for i in range(N): #the larger is N, the better it is
    df_sh = df.apply(numpy.random.permutation, axis=1)
    #where df this is my large dataframe, with 10K rows and 1K columns

    corr = df_sh.corrwith(s, axis = 1)
    #where s is the provided series (shape of s =(1000,))

这两项任务花费大约相同的时间(即每次30秒)。我尝试将我的数据帧转换为numpy.array,在数组上执行for循环,对于每一行,我首先执行排列,然后测量与scipy.stats.pearsonr的相关性。不幸的是,我设法将我的两项任务加速了2倍。 还有其他可行的选择来加速任务吗? (注意:我已经将Joblib代码的执行并行化到我正在使用的机器所允许的最大因子。

1 个答案:

答案 0 :(得分:2)

2D矩阵/阵列与1D阵列/矢量之间的相关性:

我们可以调整corr2_coeff_rowwise以获得2D数组/矩阵与1D数组/向量之间的相关性,如此 -

def corr2_coeff_2d_1d(A, B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1,keepdims=1)
    B_mB = B - B.mean()

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA,A_mA)
    ssB = B_mB.dot(B_mB)

    # Finally get corr coeff
    return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)

要对每行进行随机播放并对所有行执行此操作,我们可以使用np.random.shuffle。现在,这个shuffle函数沿着第一个轴工作。因此,为了解决我们的问题,我们需要提供转置版本。另外,请注意,这种改组将在原地完成。因此,如果其他地方需要原始数据帧,请在处理之前制作副本。因此,解决方案是 -

因此,让我们用它来解决我们的案例 -

# Extract underlying arry data for faster NumPy processing in loop later on    
a = df.values  
s_ar = s.values

# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]

for i in range(N):
    # Get shuffled indices per row with `rand+argsort/argpartition` trick from -
    # https://stackoverflow.com/a/45438143/
    idx = np.random.rand(*a.shape).argsort(1)

    # Shuffle array data with NumPy's advanced indexing
    shuffled_a = a[r, idx]

    # Compute correlation
    corr = corr2_coeff_2d_1d(shuffled_a, s_ar)

优化版本#1

现在,我们可以预先计算涉及在迭代之间保持相同的系列的部分。因此,进一步优化的版本将如下所示 -

a = df.values  
s_ar = s.values
r = np.arange(a.shape[0])[:,None]

B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)

A = a
A_mean = A.mean(1,keepdims=1)

for i in range(N):
    # Get shuffled indices per row with `rand+argsort/argpartition` trick from -
    # https://stackoverflow.com/a/45438143/
    idx = np.random.rand(*a.shape).argsort(1)

    # Shuffle array data with NumPy's advanced indexing
    shuffled_a = a[r, idx]

    # Compute correlation
    A = shuffled_a
    A_mA = A - A_mean
    ssA = np.einsum('ij,ij->i',A_mA,A_mA)
    corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)

基准

使用实际用例形状/尺寸设置输入

In [302]: df = pd.DataFrame(np.random.rand(10000,1000))

In [303]: s = pd.Series(df.iloc[0])

<强> 1。原始方法

In [304]: %%timeit
     ...: df_sh = df.apply(np.random.permutation, axis=1)
     ...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop

<强> 2。建议的方法

预处理部分(仅在开始循环之前完成一次,因此不包括在时间中) -

In [305]: a = df.values  
     ...: s_ar = s.values
     ...: r = np.arange(a.shape[0])[:,None]
     ...: 
     ...: B = s_ar
     ...: B_mB = B - B.mean()
     ...: ssB = B_mB.dot(B_mB)
     ...: 
     ...: A = a
     ...: A_mean = A.mean(1,keepdims=1)

在循环中运行的建议解决方案的一部分 -

In [306]: %%timeit
     ...: idx = np.random.rand(*a.shape).argsort(1)
     ...: shuffled_a = a[r, idx]
     ...: 
     ...: A = shuffled_a
     ...: A_mA = A - A_mean
     ...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
     ...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop

因此,我们在这里看到 3x 的加速!