我有一系列3D点:
a = np.array([[2., 3., 8.], [10., 4., 3.], [58., 3., 4.], [34., 2., 43.]])
如何计算这些点的geometric median?
答案 0 :(得分:17)
我实施了Yehuda Vardi和Cun-Hui Zhang的几何中位数算法,在他们的论文"The multivariate L1-median and associated data depth"中有所描述。一切都是numpy矢量化,所以应该非常快。我没有实施重量 - 只有未加权点。
import numpy as np
from scipy.spatial.distance import cdist, euclidean
def geometric_median(X, eps=1e-5):
y = np.mean(X, 0)
while True:
D = cdist(X, [y])
nonzeros = (D != 0)[:, 0]
Dinv = 1 / D[nonzeros]
Dinvs = np.sum(Dinv)
W = Dinv / Dinvs
T = np.sum(W * X[nonzeros], 0)
num_zeros = len(X) - np.sum(nonzeros)
if num_zeros == 0:
y1 = T
elif num_zeros == len(X):
return y
else:
R = (T - y) * Dinvs
r = np.linalg.norm(R)
rinv = 0 if r == 0 else num_zeros/r
y1 = max(0, 1-rinv)*T + min(1, rinv)*y
if euclidean(y, y1) < eps:
return y1
y = y1
除了默认的SO许可条款外,如果您愿意,我还会在zlib许可下发布上述代码。
答案 1 :(得分:6)
使用Weiszfeld的迭代算法计算几何中位数是在Python gist中实现的,或者是从OpenAlea软件(CeCILL-C许可证)复制的下面函数中实现的,
import numpy as np
import math
import warnings
def geometric_median(X, numIter = 200):
"""
Compute the geometric median of a point sample.
The geometric median coordinates will be expressed in the Spatial Image reference system (not in real world metrics).
We use the Weiszfeld's algorithm (http://en.wikipedia.org/wiki/Geometric_median)
:Parameters:
- `X` (list|np.array) - voxels coordinate (3xN matrix)
- `numIter` (int) - limit the length of the search for global optimum
:Return:
- np.array((x,y,z)): geometric median of the coordinates;
"""
# -- Initialising 'median' to the centroid
y = np.mean(X,1)
# -- If the init point is in the set of points, we shift it:
while (y[0] in X[0]) and (y[1] in X[1]) and (y[2] in X[2]):
y+=0.1
convergence=False # boolean testing the convergence toward a global optimum
dist=[] # list recording the distance evolution
# -- Minimizing the sum of the squares of the distances between each points in 'X' and the median.
i=0
while ( (not convergence) and (i < numIter) ):
num_x, num_y, num_z = 0.0, 0.0, 0.0
denum = 0.0
m = X.shape[1]
d = 0
for j in range(0,m):
div = math.sqrt( (X[0,j]-y[0])**2 + (X[1,j]-y[1])**2 + (X[2,j]-y[2])**2 )
num_x += X[0,j] / div
num_y += X[1,j] / div
num_z += X[2,j] / div
denum += 1./div
d += div**2 # distance (to the median) to miminize
dist.append(d) # update of the distance evolution
if denum == 0.:
warnings.warn( "Couldn't compute a geometric median, please check your data!" )
return [0,0,0]
y = [num_x/denum, num_y/denum, num_z/denum] # update to the new value of the median
if i > 3:
convergence=(abs(dist[i]-dist[i-2])<0.1) # we test the convergence over three steps for stability
#~ print abs(dist[i]-dist[i-2]), convergence
i += 1
if i == numIter:
raise ValueError( "The Weiszfeld's algoritm did not converged after"+str(numIter)+"iterations !!!!!!!!!" )
# -- When convergence or iterations limit is reached we assume that we found the median.
return np.array(y)
或者,您可以使用此answer中提到的C实现,并将其与python连接,例如ctypes
。
答案 2 :(得分:1)
可以使用minimize
中的scipy
模块轻松估算问题。在这个模块中,它提供了各种优化算法,从nelder-mead到newton-CG。 Nelder-Mead算法特别有用,如果你不想打扰高阶导数,代价是失去一些精度。然而,您只需要知道要使nelder-mead algorithm 起作用的最小化函数。
现在,在问题中引用相同的数组,如果我们使用@ orlp的方法,我们将得到这个:
geometric_median(a)
# array([12.58942481, 3.51573852, 7.28710661])
对于Nelder-mead方法,您将在下面看到。要最小化的功能是所有点的距离函数,即
以下是代码:
from scipy.optimize import minimize
x = [point[0] for point in a]
y = [point[1] for point in a]
z = [point[2] for point in a]
x0 = np.array([sum(x)/len(x),sum(y)/len(y), sum(z)/len(z)])
def dist_func(x0):
return sum(((np.full(len(x),x0[0])-x)**2+(np.full(len(x),x0[1])-y)**2+(np.full(len(x),x0[2])-z)**2)**(1/2))
res = minimize(dist_func, x0, method='nelder-mead', options={'xtol': 1e-8, 'disp': True})
res.x
# array([12.58942487, 3.51573846, 7.28710679])
请注意,我使用所有点作为算法的初始值。结果非常接近@ orlp的方法,这更准确。正如我所提到的,你牺牲了一点但仍然得到了相当不错的近似值。
Nelder Mead算法的性能
为此,我生成了一个test_array
,其中包含10000个以正常分布为中心的点条目。因此,几何中位数应该非常接近[3.2,3.2,3.2]。
np.random.seed(3)
test_array = np.array([[np.random.normal(3.2,20),
np.random.normal(3.2,20),
np.random.normal(3.2,20)] for i in np.arange(10000)])
对于@ orlp的方法,
%timeit geometric_median(test_array)
# 12.1 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# array([2.95151061, 3.14098477, 3.01468281])
对于Nelder mead,
%timeit res.x
# 565 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# array([2.95150898, 3.14098468, 3.01468276])
@ orlp的方法非常快,而Nelder mead也不错。然而,Nelder mead方法是通用的,而@ orlp是特定于几何中位数。您想要选择的方法取决于您的目的。如果您只想要一个近似值,我会选择Nelder。如果你想要准确,@ orlp的方法既快又准确。