我使用sklearn Kmeans聚类了一个数据样本(400 k样本,维度= 205,200个集群)。
我想知道,对于每个群集,群集中心与最远的群集样本之间的最大距离,以便了解群集的“大小”。 这是我的代码:
import numpy as np
import scipy.spatial.distance as spd
diam = np.empty([200])
for i in range(200):
diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max()
“种子”是聚类中心(200x206)。 “种子”的第一列包含群集中的样本数量(此处不相关)。
“data”是样本(400kx206)。数据的第一列包含群集号。
问题:这是使用循环完成的(不是那么“numpy”)。它可以“矢量化”吗?
答案 0 :(得分:2)
我们可以更加智能地进行索引并节省约4倍的费用。
首先让我们构建一些正确形状的数据:
seed = np.random.randint(0, 100, (200,206))
data = np.random.randint(0, 100, (4e5,206))
seed[:, 0] = np.arange(200)
data[:, 0] = np.random.randint(0, 200, 4e5)
diam = np.empty(200)
原答案的时间:
%%timeit
for i in range(200):
diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max()
1 loops, best of 3: 1.35 s per loop
moarningsun的回答:
%%timeit
seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)
1 loops, best of 3: 1.33 s per loop
%%timeit
data_sorted = data[data[:, 0].argsort()]
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
diam_out = np.maximum.reduceat(dists,shift_idx)
1 loops, best of 3: 1.65 s per loop
正如我们所看到的那样,除了更大的内存占用之外,还没有真正获得任何矢量化解决方案。为了避免这种情况,我们需要返回原始答案,这实际上是执行这些操作的正确方法,而是尝试减少索引量:
%%timeit
idx = data[:,0].argsort()
bins = np.bincount(data[:,0])
counter = 0
for i in range(200):
data_slice = idx[counter: counter+bins[i]]
diam[i] = spd.cdist(seed[None, i, 1:], data[data_slice, 1:]).max()
counter += bins[i]
1 loops, best of 3: 281 ms per loop
仔细检查答案:
np.allclose(diam, dam_out)
True
这是假设python循环不好的问题。它们通常是,但不是在所有情况下。
答案 1 :(得分:1)
这是一种矢量化方法 -
# Sort data w.r.t. col-0
data_sorted = data[data[:, 0].argsort()]
# Get counts of unique tags in col-0 of data and repeat seed accordingly.
# Thus, we would have an extended version of seed that matches data's shape.
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
# Get euclidean distances between extended seed version and sorted data
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
# Get positions of shifts in col-0 of sorted data
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
# Final piece of puzzle is to get tag based maximum values from dists,
# where each tag is unique number in col-0 of data
diam_out = np.maximum.reduceat(dists,shift_idx)
运行时测试并验证输出 -
定义功能:
def loopy_cdist(seed,data): # from OP's solution
N = seed.shape[0]
diam = np.empty(N)
for i in range(N):
diam[i]=spd.cdist(seed[np.newaxis,i,1:], data[data[:,0]==i][:,1:]).max()
return diam
def vectorized_repeat_reduceat(seed,data): # from this solution
data_sorted = data[data[:, 0].argsort()]
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
return np.maximum.reduceat(dists,shift_idx)
def vectorized_indexing_maxat(seed,data): # from @moarningsun's solution
seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)
return diam
验证输出:
In [417]: # Inputs
...: seed = np.random.rand(20,20)
...: data = np.random.randint(0,20,(40000,20))
...:
In [418]: np.allclose(loopy_cdist(seed,data),vectorized_repeat_reduceat(seed,data))
Out[418]: True
In [419]: np.allclose(loopy_cdist(seed,data),vectorized_indexing_maxat(seed,data))
Out[419]: True
运行时:
In [420]: %timeit loopy_cdist(seed,data)
10 loops, best of 3: 35.9 ms per loop
In [421]: %timeit vectorized_repeat_reduceat(seed,data)
10 loops, best of 3: 28.9 ms per loop
In [422]: %timeit vectorized_indexing_maxat(seed,data)
10 loops, best of 3: 24.1 ms per loop
答案 2 :(得分:1)
与@Divakar非常相似,但无需排序:
seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)
众所周知,{p> ufunc.at
速度很慢,所以看哪个更快会很有趣。