我从聪明的方式看,以优化这个循环的欧氏距离计算。该计算寻找与所有其他向量的平均距离。
因为我的矢量数组真的很大:eucl_dist = euclidean_distances(eigen_vs_cleaned) 我正在逐行运行循环。
目前典型的eigen_vs_cleaned形状至少(300000,1000),我必须更多。 (如2000000,10000)
更聪明的方法吗?
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
from sklearn.metrics.pairwise import euclidean_distances
for z in range(eigen_vs_cleaned.shape[0]):
if z%10000==0:
print(z)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[z].reshape(1, -1), eigen_vs_cleaned)
eucl_dist_meaned[z] = eucl_dist_temp.mean(axis=1)
答案 0 :(得分:0)
我不是python / numpy guru但是这是我优化它的第一步。它至少在我的MacPro上跑得更好。
from joblib import Parallel, delayed
import multiprocessing
import os
import tempfile
import shutil
from sklearn.metrics.pairwise import euclidean_distances
# Creat a temporary directory and define the array pat
path = tempfile.mkdtemp()
out_path = os.path.join(path,'out.mmap')
out = np.memmap(out_path, dtype=float, shape=eigen_vs_cleaned.shape[0], mode='w+')
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
num_cores = multiprocessing.cpu_count()
def runparallel(row, out):
if row%10000==0:
print(row)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[row].reshape(1, -1), eigen_vs_cleaned)
out[row] = eucl_dist_temp.mean(axis=1)
##
nothing = Parallel(n_jobs=num_cores)(delayed(runparallel)(r, out) for r in range(eigen_vs_cleaned.shape[0]))
然后我保存输出:
eucl_dist_meaned = np.array(out,copy=True,dtype=float)