我是python的新手。我正在使用dbscan
代码进行聚类以进行一些更改。现在代码运行正常,但速度非常慢。所以我发现我必须从代码中删除'for loop'。这是代码的一部分:
class Point:
def __init__(self, x = 0, y = 0, visited = False, isnoise = False):
self.x = x
self.y = y
self.visited = False
self.isnoise = False
def show(self):
return self.x, self.y
def dist(self, p1, p2):
#Calculate the great circle distance between two points on the earth (specified in decimal degrees)return distance between two point
# convert decimal degrees to radians
dlat = radians(p2.x-p1.x)
dlon = radians(p2.y-p1.y)
a = sin(dlat/2) * sin(dlat/2) + cos(radians(p1.x))* cos(radians(p2.x)) * sin(dlon/2) * sin(dlon/2)
c = 2 * atan2(sqrt(a), sqrt(1-a))
d = 6371 * c
return d
def distanceQuery(self,neighbor_pts):
dista=[]
for i in range(len(neighbor_pts)):
for j in range(i+1,len(neighbor_pts)):
z=self.dist(neighbor_pts[i],neighbor_pts[j])
dista.append(z)
return max(dista)
distanceQuery
函数正在使用double for循环。有什么办法可以删除吗?我可以将这个双向量化为循环吗?由于这是聚类代码,因此有一些步骤需要追加。在附加时,我已经读过numpy数组的工作方式与python列表不同。附加numpy数组是低效的。
所以这可以是矢量化。但是这里是代码的其他部分,在我检查某些条件之后发生追加。
def expandCluster(self, P, neighbor_points):
self.cluster[self.cluster_inx].append(P)
iterator = iter(neighbor_points)
while True:
try:
npoint_tmp = iterator.next()
except StopIteration:
# StopIteration exception is raised after last element
break
if (not npoint_tmp.visited):
#for each point P' in NeighborPts
npoint_tmp.visited = True
NeighborPts_ = self.regionQuery(npoint_tmp)
if (len(NeighborPts_) >= self.MinPts):
for j in range(len(NeighborPts_)):
neighbor_points.append(NeighborPts_[j])
if self.distanceQuery(neighbor_points)>0.10:
break
现在,如果我也矢量化neighbor_points。我将不得不解决附加问题?因此,每个点都会附加到neighbour_points
,然后它会生成distanceQuery
。而这个过程也是迭代的一部分。所以这里也有两个循环。我只想确保在numpy数组中附加不会效率低下
答案 0 :(得分:5)
import numpy as np
def dist(p1, p2):
# Initially, p1.shape() == (n, 2) and p2.shape() == (m, 2)
# Now, p1.shape() == (1, n, 2) and p2.shape() == (m, 1, 2)
p1 = p1[np.newaxis, :, :]
p2 = p2[:, np.newaxis, :]
# get all the vectory things
from numpy import sin, cos, radians, sqrt, arctan2 as atan2
# do the same math as before, but use `p[..., 0]` instead of `p.x` etc
dlat = radians(p2[..., 0] - p1[..., 0])
dlon = radians(p2[..., 1] - p1[..., 1])
a = sin(dlat/2) * sin(dlat/2) + cos(p1[..., 0])*cos(p2[..., 0]) * sin(dlon/2) * sin(dlon/2)
c = 2 * atan2(sqrt(a), sqrt(1-a))
d = 6371 * c
return d
def distanceQuery(neighbor_pts):
return np.max(dist(neighbor_pts, neighbor_pts))
e.g:
>>> points = np.array([[0, 0], [45, 0], [45, 45], [90, 0]], dtype=float)
>>> dist(points, points)
array([[ 0. , 5003.77169901, 6272.52596983, 10007.54339801],
[ 5003.77169901, 0. , 2579.12525679, 5003.77169901],
[ 6272.52596983, 2579.12525679, 0. , 4347.69702221],
[ 10007.54339801, 5003.77169901, 4347.69702221, 0. ]])
>>> np.max(_)
10007.543398010286
def dist_slow(p1, p2):
"""your function, adjusted to take an array instead of a `Point`"""
from math import radians, cos, sqrt, atan2
# compute the distance for all possible pairs
dlat = radians(p2[0]-p1[0])
dlon = radians(p2[1]-p1[1])
a = sin(dlat/2) * sin(dlat/2) + cos(radians(p1[0]))*cos(radians(p2[0])) * sin(dlon/2) * sin(dlon/2)
c = 2 * atan2(sqrt(a), sqrt(1-a))
d = 6371 * c
return d
def query_iter(p):
return max(dist_slow(p1, p2) for p1, p2 in itertools.combinations(p, 2))
def query_orig(p):
dista=[]
for i in range(len(p)):
for j in range(i + 1, len(p)):
z = dist_slow(p[i], p[j])
dista.append(z)
return max(dista)
def query_mine(p):
return dist(p, p).max()
然后:
>>> points = np.random.rand(1000, 2)
>>> timeit query_orig(points)
1 loops, best of 3: 7.94 s per loop
>>> timeit query_iter(points)
1 loops, best of 3: 7.35 s per loop
>>> timeit query_mine(points)
10 loops, best of 3: 150 ms per loop
答案 1 :(得分:4)
你可以做任何事情" vector"形式与numpy ufunc:
from numpy import radians, sin, cos, sqrt, arctan2
from numpy import random
def max_dist(p1x,p1y,p2x,p2y):
# give them "orthogonal" shape
p1x = p1x.reshape(p1x.size,1)
p1y = p1y.reshape(p1y.size,1)
p2x = p2x.reshape(1,p2x.size)
p2y = p2y.reshape(1,p2y.size)
# compute the distance for all possible pairs
dlat = radians(p2x-p1x)
dlon = radians(p2y-p1y)
a = sin(dlat/2) * sin(dlat/2) + cos(radians(p1x))*cos(radians(p2x)) * sin(dlon/2) * sin(dlon/2)
c = 2 * arctan2(sqrt(a), sqrt(1-a))
d = 6371 * c
return d.max()
if __name__=='__main__':
# generate random samples
N = 1000
p1x,p1y,p2x,p2y = random.rand(4,N)
print 'max_dist=',max_dist(p1x,p1y,p2x,p2y)
答案 2 :(得分:2)
不确定向量化,但你肯定可以将double-for循环转换为列表理解。因为你只需要获取该列表的最大值,你也可以使用生成器表达式。
def distGen(pts):
return max(dist(pts[i], pts[j]) for i in range(len(pts))
for j in range(i+1, len(pts)))
我对此做了一些时序分析,这似乎至少要快一点。有趣的是,与我的直觉相反,使用列表推导而不是生成器更快,但生成器应该具有使用更少内存的优势。
1.15502595901 # your approach
1.37675499916 # your approach single max value var instead of list
1.00971293449 # above generator expression
0.916918992996 # above with list comprehension, i.e., max([...])
(使用1000个随机数而不是点数进行测试,并使用dist
测量这些数字之间的绝对距离。)
更好 - 更清洁,更快一点 - 使用itertools.combinations
来获得两点的所有组合:
import itertools
def distComb(pts):
return max(dist(p1, p2) for p1, p2 in itertools.combinations(pts, 2))
答案 3 :(得分:2)
这是另一种解决方案,它首先将您的所有点映射到单位球体上:
import numpy as np
import scipy.spatial
def sphereify(points):
"""lat, long -> x, y, z for a unit sphere"""
lat = np.radians(points[:, 0, np.newaxis])
long = np.radians(points[:, 1, np.newaxis])
return np.hstack((
np.cos(lat) * np.cos(long),
np.cos(lat) * np.sin(long),
np.sin(lat)
))
def arcDistance(chordDistance):
"""Get the surface distance corresponding to the chord distance"""
return np.arcsin(chordDistance / 2) * 2
earthRadius = 6371
def query(points):
dists = scipy.spatial.distance.pdist(sphereify(points))
surfaceDist = earthRadius * arcDistance(dist.max())
return surfaceDist
然后:
>>> timeit query(points)
100 loops, best of 3: 6.23 ms per loop