我有以下具有100行的示例地理数据,我想将这些POI聚类为10个组,每组10个点,如果可能的话,还要将每个组中的数据都来自同一区域。
id areas lng lat
1010094160 A 116.31967 40.03229
1010737675 A 116.28941 40.03968
1010724217 A 116.32256 40.048
1010122181 A 116.28683 40.09652
1010732739 A 116.33482 40.06456
1010730289 A 116.3724 40.04066
1010737817 A 116.24174 40.074
1010124109 A 116.2558 40.08371
1010732695 B 116.31591 40.07096
1010112361 B 116.33331 39.96539
1010042095 B 116.31283 39.98804
1010097579 B 116.37637 39.98865
1010110203 B 116.41351 40.00851
1010085120 B 116.41364 39.98069
1010310183 B 116.42757 40.03738
1010087029 B 116.38947 39.97715
1010737155 B 116.38391 39.9849
1010729305 B 116.37803 40.04512
1010085100 B 116.37679 39.98838
1010750159 B 116.32162 39.98518
1010061742 B 116.31618 39.99087
1010091848 B 116.37617 39.97739
1010104343 C 116.3295 39.98156
1010091704 C 116.37236 39.9943
1010086652 C 116.36102 39.92978
1010030017 C 116.39017 39.99287
1010091851 C 116.35854 40.0063
1010705229 C 116.39114 39.97511
1010107321 C 116.42535 39.95417
1010130423 C 116.31651 40.04164
1010126133 C 116.29051 40.05081
1010177543 C 116.41114 39.99635
1010123271 C 116.35923 40.02031
1010315589 C 116.33906 39.99895
这是预期的结果
id areas lng lat clusterNumber
1010094160 A 116.31967 40.03229 0
1010737675 A 116.28941 40.03968 0
1010724217 A 116.32256 40.048 0
1010122181 A 116.28683 40.09652 0
1010732739 A 116.33482 40.06456 0
1010730289 A 116.3724 40.04066 0
1010737817 A 116.24174 40.074 0
1010124109 A 116.2558 40.08371 0
1010732695 B 116.31591 40.07096 0
1010112361 B 116.33331 39.96539 1
1010042095 B 116.31283 39.98804 1
1010097579 B 116.37637 39.98865 1
1010110203 B 116.41351 40.00851 1
1010085120 B 116.41364 39.98069 1
1010310183 B 116.42757 40.03738 1
1010087029 B 116.38947 39.97715 1
1010737155 B 116.38391 39.9849 1
1010729305 B 116.37803 40.04512 1
1010085100 B 116.37679 39.98838 1
1010750159 B 116.32162 39.98518 2
1010061742 B 116.31618 39.99087 2
1010091848 B 116.37617 39.97739 2
1010104343 C 116.3295 39.98156 2
1010091704 C 116.37236 39.9943 2
1010086652 C 116.36102 39.92978 2
1010030017 C 116.39017 39.99287 2
1010091851 C 116.35854 40.0063 2
1010705229 C 116.39114 39.97511 2
1010107321 C 116.42535 39.95417 2
1010130423 C 116.31651 40.04164 3
1010126133 C 116.29051 40.05081 3
我已经尝试过使用K均值,但是我不能保持每个组的大小相同。我可以在Python中使用其他更好的方法吗?请分享您的想法和提示。谢谢
这是我尝试过的:
X = []
for row in result:
X.append([float(row['lng']), float(row['lat'])])
X = np.array(X)
n_clusters = 100
cls = KMeans(n_clusters, random_state=0).fit(X)
#cls = EqualGroupsKMeans(n_clusters, random_state=0).fit(X)
#km1 = KMeans(n_clusters=6, n_init=25, max_iter = 600, random_state=0)
cls.labels_
markers = ['^','x','o','*','+', '+']
colors = ['b', 'c', 'g', 'k', 'm', 'r']
for i in range(n_clusters):
members = cls.labels_ == i
print(len(X[members,0]))
#plt.scatter(X[members,0],X[members,1],s=6,marker=markers[i],c=colors[i],alpha=0.5)
plt.scatter(X[members,0],X[members,1],s=6,marker="^",c=colors[i%6],alpha=0.5)
plt.title(' ')
plt.show()
这是我在Github中找到的Same-Size-K-Means参考: