Question

我有一个包含位置（坐标）和每个位置的标量属性（例如温度）的数据集。我需要根据标量属性对位置进行聚类，但要考虑位置之间的距离。

问题是，以温度为例，彼此相距较远的位置可能具有相同的温度。如果我根据温度进行聚类，那么这些位置将不应该位于同一聚类中。如果彼此靠近的两个位置的温度不同，则相反。在这种情况下，对温度进行聚类可能会导致这些观测值位于不同的聚类中，而基于距离矩阵的聚类会将它们置于同一聚类中。

那么，有没有一种方法可以将观察值集中到一个属性（温度）上，然后根据距离矩阵进行“细化”？

这是一个简单的示例，显示根据属性是用作基础还是距离矩阵，聚类如何不同。我的目标是能够同时使用属性和距离矩阵，从而更加重视该属性。

import numpy as np
import matplotlib.pyplot as plt
import haversine
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd

# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)

t = np.random.randint(0, 20, size=(100,1))

# Compute distance matrix
D = np.zeros((len(x),len(y)))
for k in range(len(x)):
    for j in range(len(y)):
        distance_pair= haversine.distance((x[k], y[k]), (x[j], y[j]))
        D[k,j] = distance_pair

# Compare clustering alternatives
Zt = linkage(t, 'complete')
Zd = linkage(ssd.squareform(D), method="complete")

# Cluster based on t
clt = fcluster(Zt, 5, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=clt)  
plt.show()

# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld)  
plt.show()

haversine.py可在此处使用：https://gist.github.com/rochacbruno/2883505

谢谢。

首先基于属性和距离矩阵对观察结果进行聚类

0 个答案: