循环sklearn欧几里德距离优化

时间:2016-07-04 09:25:14

标签: python numpy scipy scikit-learn

我从聪明的方式看,以优化这个循环的欧氏距离计算。该计算寻找与所有其他向量的平均距离。

因为我的矢量数组真的很大:eucl_dist = euclidean_distances(eigen_vs_cleaned) 我正在逐行运行循环。

目前典型的eigen_vs_cleaned形状至少(300000,1000),我必须更多。 (如2000000,10000)

更聪明的方法吗?

eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)

from sklearn.metrics.pairwise import euclidean_distances
for z in range(eigen_vs_cleaned.shape[0]):
    if z%10000==0:
        print(z)
    eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[z].reshape(1, -1), eigen_vs_cleaned)
    eucl_dist_meaned[z] = eucl_dist_temp.mean(axis=1)

1 个答案:

答案 0 :(得分:0)

我不是python / numpy guru但是这是我优化它的第一步。它至少在我的MacPro上跑得更好。

from joblib import Parallel, delayed
import multiprocessing
import os
import tempfile
import shutil

from sklearn.metrics.pairwise import euclidean_distances

# Creat a temporary directory and define the array pat
path = tempfile.mkdtemp()
out_path = os.path.join(path,'out.mmap')
out = np.memmap(out_path, dtype=float, shape=eigen_vs_cleaned.shape[0], mode='w+')

eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)

num_cores = multiprocessing.cpu_count()

def runparallel(row, out):
    if row%10000==0:
        print(row)
    eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[row].reshape(1, -1), eigen_vs_cleaned)
    out[row] = eucl_dist_temp.mean(axis=1)
    ##

nothing = Parallel(n_jobs=num_cores)(delayed(runparallel)(r, out) for r in range(eigen_vs_cleaned.shape[0]))

然后我保存输出:

eucl_dist_meaned = np.array(out,copy=True,dtype=float)