Question

我有一个用于计算成对距离和数据残差（X，Y，Z）的代码。数据非常大（平均7000行），所以我的兴趣是代码效率。我的初始代码是

import tkinter as tk
from tkinter import filedialog
import pandas as pd
import, numpy as np
from scipy.spatial.distance import pdist, squareform

root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()

data = pd.read_excel(file_path)
data = np.array(data, dtype=np.float)
npoints, cols = data.shape

pwdistance = np.zeros((npoints, npoints))
pwresidual = np.zeros((npoints, npoints))
for i in range(npoints):
    for j in range(npoints):
        pwdistance[i][j] = np.sqrt((data[:,0][i]-data[:,0][j])**2 + (data[:,1][i]-data[:,1][j])**2)
        pwresidual[i][j] = (data[:,2][i]-data[:,2][j])**2

使用pwdistance，我将其更改为以下内容，效果非常好。

pwdistance = squareform(pdist(data[:,:2]))

是否有一种计算我的pwresidual的pythonic方法，所以我不需要使用循环并使我的代码运行得更快？

Answer 1

一种方法是扩展data的第二列切片的维度，以形成2D数组，并从中减去1D切片本身。这些减法将按照broadcasting的规则以矢量化方式执行。

因此，只需做 -

pwresidual = (data[:,2,None] - data[:,2])**2

分步运行 -

In [132]: data[:,2,None].shape # Slice extended to a 2D array
Out[132]: (4, 1)

In [133]: data[:,2].shape # Slice as 1D array
Out[133]: (4,)

In [134]: data[:,2,None] - data[:,2] # Subtractions with broadcasting
Out[134]: 
array([[ 0.        ,  0.67791602,  0.13298141,  0.61579315],
       [-0.67791602,  0.        , -0.54493461, -0.06212288],
       [-0.13298141,  0.54493461,  0.        ,  0.48281174],
       [-0.61579315,  0.06212288, -0.48281174,  0.        ]])

In [137]: (data[:,2,None] - data[:,2]).shape # Verify output shape
Out[137]: (4, 4)

In [138]: (data[:,2,None] - data[:,2])**2 # Finally elementwise square
Out[138]: 
array([[ 0.        ,  0.45957013,  0.01768406,  0.3792012 ],
       [ 0.45957013,  0.        ,  0.29695373,  0.00385925],
       [ 0.01768406,  0.29695373,  0.        ,  0.23310717],
       [ 0.3792012 ,  0.00385925,  0.23310717,  0.        ]])

成对距离和残差计算优化

1 个答案: