使用Python中的K均值将地理数据聚集成相同大小的组

时间:2019-01-08 05:23:01

标签: python pandas dataframe k-means

我有以下具有100行的示例地理数据,我想将这些POI聚类为10个组,每组10个点,如果可能的话,还要将每个组中的数据都来自同一区域。

    id    areas    lng        lat
1010094160  A   116.31967   40.03229
1010737675  A   116.28941   40.03968
1010724217  A   116.32256   40.048
1010122181  A   116.28683   40.09652
1010732739  A   116.33482   40.06456
1010730289  A   116.3724    40.04066
1010737817  A   116.24174   40.074
1010124109  A   116.2558    40.08371
1010732695  B   116.31591   40.07096
1010112361  B   116.33331   39.96539
1010042095  B   116.31283   39.98804
1010097579  B   116.37637   39.98865
1010110203  B   116.41351   40.00851
1010085120  B   116.41364   39.98069
1010310183  B   116.42757   40.03738
1010087029  B   116.38947   39.97715
1010737155  B   116.38391   39.9849
1010729305  B   116.37803   40.04512
1010085100  B   116.37679   39.98838
1010750159  B   116.32162   39.98518
1010061742  B   116.31618   39.99087
1010091848  B   116.37617   39.97739
1010104343  C   116.3295    39.98156
1010091704  C   116.37236   39.9943
1010086652  C   116.36102   39.92978
1010030017  C   116.39017   39.99287
1010091851  C   116.35854   40.0063
1010705229  C   116.39114   39.97511
1010107321  C   116.42535   39.95417
1010130423  C   116.31651   40.04164
1010126133  C   116.29051   40.05081
1010177543  C   116.41114   39.99635
1010123271  C   116.35923   40.02031
1010315589  C   116.33906   39.99895

这是预期的结果

   id     areas    lng         lat  clusterNumber
1010094160  A   116.31967   40.03229    0
1010737675  A   116.28941   40.03968    0
1010724217  A   116.32256   40.048      0
1010122181  A   116.28683   40.09652    0
1010732739  A   116.33482   40.06456    0
1010730289  A   116.3724    40.04066    0
1010737817  A   116.24174   40.074      0
1010124109  A   116.2558    40.08371    0
1010732695  B   116.31591   40.07096    0
1010112361  B   116.33331   39.96539    1
1010042095  B   116.31283   39.98804    1
1010097579  B   116.37637   39.98865    1
1010110203  B   116.41351   40.00851    1
1010085120  B   116.41364   39.98069    1
1010310183  B   116.42757   40.03738    1
1010087029  B   116.38947   39.97715    1
1010737155  B   116.38391   39.9849     1
1010729305  B   116.37803   40.04512    1
1010085100  B   116.37679   39.98838    1
1010750159  B   116.32162   39.98518    2
1010061742  B   116.31618   39.99087    2
1010091848  B   116.37617   39.97739    2
1010104343  C   116.3295    39.98156    2
1010091704  C   116.37236   39.9943     2
1010086652  C   116.36102   39.92978    2
1010030017  C   116.39017   39.99287    2
1010091851  C   116.35854   40.0063     2
1010705229  C   116.39114   39.97511    2
1010107321  C   116.42535   39.95417    2
1010130423  C   116.31651   40.04164    3
1010126133  C   116.29051   40.05081    3

我已经尝试过使用K均值,但是我不能保持每个组的大小相同。我可以在Python中使用其他更好的方法吗?请分享您的想法和提示。谢谢
这是我尝试过的:

X = []
for row in result:
    X.append([float(row['lng']), float(row['lat'])])

X = np.array(X)

n_clusters = 100
cls = KMeans(n_clusters, random_state=0).fit(X)
#cls = EqualGroupsKMeans(n_clusters, random_state=0).fit(X)

#km1 = KMeans(n_clusters=6, n_init=25, max_iter = 600, random_state=0)

cls.labels_

markers = ['^','x','o','*','+', '+']
colors = ['b', 'c', 'g', 'k', 'm', 'r']
for i in range(n_clusters):
  members = cls.labels_ == i
  print(len(X[members,0]))
  #plt.scatter(X[members,0],X[members,1],s=6,marker=markers[i],c=colors[i],alpha=0.5)
  plt.scatter(X[members,0],X[members,1],s=6,marker="^",c=colors[i%6],alpha=0.5)
plt.title(' ')
plt.show()

这是我在Github中找到的Same-Size-K-Means参考:

https://github.com/ndanielsen/Same-Size-K-Means

0 个答案:

没有答案