将数据实时分组到2D阵列中

时间:2017-08-03 06:44:42

标签: python arrays numpy binning

我正在计算两个距离并在2D阵列中以0.1的间隔对它们进行分组。目前我正在这样做。然而,大量积分需要花费大量时间

import numpy as np
from scipy.spatial import distance as d
dat=np.random.rand(100,3)
dd2d=np.zeros((10,10))
while len(dat)>0:
    i=len(dat)-1
    while i>0:
        dist0=d.euclidean(dat[0],dat[i])
        dist1=d.cosine(dat[0],dat[i])
        ind0=int(dist0/0.1)
        ind1=int(dist1/0.1)
        if ind0>9 or ind1>9:
            pass
        else:
            dd2d[ind0,ind1]+=1
        i-=1
    dat=np.delete(dat,0,axis=0)
    print len(dat)

最有效的方法是什么?

另外,如何将代码中的while循环转换为for循环,以便我可以添加progressbar / tqdm来跟踪运行时间。

1 个答案:

答案 0 :(得分:2)

If you are already importing scipy.spatial.distance, might as well use pdist. And then you're just making a 2d histogram. Use np.histogram2d.

def binDists2d(dat, f1 = 'euclidean', f2 = 'cosine'):
    dist0 = d.pdist(dat, f1)
    dist1 = d.pdist(dat, f2)
    rng = np.array([[0, 1], [0, 1]])
    return np.histogram2d(dist0, dist1, bins = 10, range = rng)

pdist only returns the upper triangular elements. If you want to do this manually, use np.triu_indices, which you could use to generate the distances if scipy is unavailable.

def cosdist(u, v):
    return 1 - u.dot(v) / (np.linalg.norm(u) * np.linlg.norm(v))

def binDists2d(dat, f0 = lambda u, v: np.linalg.norm(u - v), f1 = cosdist):
    i, j = np.triu_indices(dat.shape[0], 1)
    dist0 = f0(dat[i], dat[j])
    dist1 = f1(dat[i], dat[j])
    rng = np.array([[0, 1], [0, 1]])
    return np.histogram2d(dist0, dist1, bins = 10, range = rng)  

EDIT: Less memory-hungry version:

def binDists2d(dat, f0, f1, n = 1, bins = 10, rng = np.array([[0, 1], [0, 1]])):
    i_, j_ = np.triu_indices(dat.shape[0], 1)
    out = np.zeros((bins, bins))
    i_, j_ = np.array_split(i_, n), np.array_split(j_, n)
    for k, (i, j) in enumerate(zip(i_, j_)):
        dist0 = f0(dat[i], dat[j])
        dist1 = f1(dat[i], dat[j])
        out += np.histogram2d(dist0, dist1, bins = bins, range = rng)
        print(str(k) + " of " + str(n) + "completed")
    return out