在给定一定条件下计算numpy数组各元素之间差异的快速方法

时间:2017-08-01 05:24:46

标签: python numpy

给定一个numpy数组,我试图找出每个元素对的差异,条件是差异应小于1000000且大于-1000000

pairWiseDiff = np.array([])
for ts in timestamps:
    diffTime = timestamps - ts
    individualTimeDiff = diffTime[np.logical_and(diffTime<1000000.0, diffTime>-1000000.0)]
    pairWiseDiff = np.append(pairWiseDiff, individualTimeDiff)

此脚本适用于输入数组长度小于40000的用例。但是,我想得到长度为150000的数组的结果,上面的脚本结果非常慢。

提高速度的建议?

3 个答案:

答案 0 :(得分:0)

假设timestamps是1d。滥用(a-b) = -(b-a)np.absolute

def filterDiff(ts, d, f = lambda x: np.absolute(x)):
    i, j = np.triu_indices(ts.size, 1)
    diffs = ts[i] - ts[j]
    mask = f(diffs) < d  
    out = np.zeros((ts.size, ts.size))
    out[i[mask], j[mask]] = diffs[mask]
    out[j[mask], i[mask]] = -diffs[mask]
    return out

编辑:对于n-d数组

def filterDiff(ts, d, f = lambda x: np.absolute(x)): # f must return scalar from ts.shape[1:]
    i, j = np.triu_indices(ts.shape[0], 1)
    diffs = ts[i] - ts[j]
    mask = f(diffs) < d  
    out = np.zeros((ts.shape[0], ts.shape[0],) + ts.shape[1:])
    out[i[mask], j[mask]] = diffs[mask]
    out[j[mask], i[mask]] = -diffs[mask]
    return out

编辑2 :对于内存问题,您可能需要对diffsmask次计算进行分块。您可以尽可能地使用mem来尽可能多地使用内存。

def filterDiff(ts, d, mem = 200000000, f = lambda x: np.absolute(x)):
    i, j = np.triu_indices(ts.shape[0], 1)
    t = i.size // mem + 1
    i, j = np.array_split(i, t), np.array_split(j, t) 
    diffs = []   # iterating and constructing over lists is easier than arrays
    for k, (i_, j_) in enumerate(zip(i, j)):
        diff_ = ts[i_] - ts[j_]
        mask = f(diff_) < d
        diffs.append(list(diff_[mask]))
        i[k], j[k] = i_[mask], j[mask]
    diffs = np.array(diffs)
    i, j = np.hstack(i), np.hstack(j)
    return i, j, diffs  

根据稀疏mask的方式,您希望如何处理异常,或者您正在进行的后续计算,您可能希望跳过生成out(就像我为内存所做的那样) - 保存版本)和return i[mask], j[mask], diffs[mask]。您也可以改为out scipy.sparse.coo_matrixnp.ma.maskedarray对象。

这会减少从O(n^2)O(n*(n-1)/2)的计算次数,并摆脱许多for循环。

答案 1 :(得分:0)

您可以将pairWiseDiff的计算次数减少到一半以下。时间戳有N个元素,pairWiseDiff有N*N个元素。考虑pairWiseDiff:第一行是零,其他元素与对角线对称,你不必计算两次。因此,从N*N元素开始,您只需计算(N*N - N)/2

在我的解决方案中,pairWiseDiff仍有N*N个元素。你可以用numpy三对角数组来改善它。并且可能使用numpy.roll和/或numpy.slice消除for循环。

import numpy as np
timestamps = np.array([0,1,2,3,4,5,6,7,8,9])+10
N = len(timestamps)
pairWiseDiff = np.zeros((N,N))

for n in range(1,N):
    pairWiseDiff[n,n:N] = timestamps[n:N] - timestamps[0:N-n]
    print(n,timestamps[n:N])

pairWiseDiff 

答案 2 :(得分:0)

在时间戳中只有1000个元素,当您避免不必要的计算时以及避免np.append(pairWiseDiff, diffTime)时,您可以获得加速因子1000 runA是您的代码,runB避免np.appendrunC避免不必要的计算。

run A:      1 loop,  best of 3: 1.64 s per loop
run B:    100 loops, best of 3: 7.5 ms per loop
run C:    100 loops, best of 3: 6.8 ms per loop

问题是,您真的需要N*N = reshape(N,N) - 数组pairWiseDiff,因为您需要的所有信息以及timestamps中的所有信息都来自:

Dif  =  timestamps -timestamps[0]

这里我比较了3个代码:

timestamps  = np.arange(1000) + 10

def runA(timestamps):
    N = len(timestamps)
    pairWiseDiff = np.array([])
    for ts in timestamps:
        diffTime = timestamps - ts
        #individualTimeDiff = diffTime[np.logical_and(diffTime<1000000.0, diffTime>-1000000.0)]
        pairWiseDiff = np.append(pairWiseDiff, diffTime)
    return pairWiseDiff.reshape(N,N)

def runB(timestamps):
    N = len(timestamps)
    pairWiseDiff = np.zeros((N*N))
    for n,ts in enumerate(timestamps):
        diffTime = timestamps - ts
        #individualTimeDiff = diffTime[np.logical_and(diffTime<1000000.0, diffTime>-1000000.0)]
        j = slice( N*n,  N*(n+1) )
        pairWiseDiff[j] = diffTime
    return pairWiseDiff.reshape(N,N)

def runC(timestamps):
    N = len(timestamps)
    pairWiseDiff = np.zeros((N,N))
    Dif  =  timestamps -timestamps[0] 
    #iDif = diffTime[np.logical_and(Dif<1000000.0, Dif>-1000000.0)]
    for n in range(0,N):
        pairWiseDiff[n,n:N] =   Dif[0:N-n]
        pairWiseDiff[n,0:n] =  -Dif[n:0:-1]
    return pairWiseDiff

%timeit runA(timestamps); #print(pairWiseDiff); print()
%timeit runB(timestamps); #print(pairWiseDiff); print()
%timeit runC(timestamps); #print(pairWiseDiff); print()