Question

我设法使用DBSCAN从我的数据库中聚集了数千个地理位置。如何检索包含每个群集的ID的数组，而不是纬度/经度列表？

更确切地说：

我有一个包含数千个地点的数据库（模型ThePlace）：

id   |   placeLat          | placeLng<BR>
1     |  -0.72840701       |  1.07480303<BR>
2     |  0.56603302        | -0.71806147<BR>
3     |  -0.85542777       | 0.80393827<BR>
4     |  0.6079188         | -0.65524001<BR>
5     |  -0.68533746       | 0.5591115<BR>
6     |  0.54826708        | -0.80626836<BR>
7     |  0.89279842        | -0.68575192<BR>
8     |  0.46384115        | -0.66288763<BR>
................

Etc等。

以下是Django中的代码：

allplaces = ThePlace.objects.all()
centers = [[place.placeLat, place.placeLng] for place in allplaces]
print(centers)

返回：

[[69.6140162630014, 26.8535041809082], [10.791441, 79.1368305], [52.6237376, -3.83939629999998], [21.6229701, -81.5629847], [46.798924, -71.224765], [31.5051046, -5.9447371],    #etc...

然后我实现了DBSCAN demo建议的代码：

X, labels_true = make_blobs(n_samples=numberofplaces, centers=centers, cluster_std=0.4, random_state=0)
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.01, min_samples=2).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print("1st cluster:")
print(X[labels == 0])

返回：

[[ 0.69845443 -0.12814653] [ 0.64770332 -0.14093706] [ 0.69437909 -0.13627011] [ 0.67780877 -0.12647872] [ 0.71573886 -0.09318022] [ 0.6779438  -0.13639582]]

我想获得每个位置的 ID 的数组，而不是具有纬度/经度的数组。有可能吗？

例如，我想得到：

[[ 2, 4, 5, 12] [3, 7, 11] [5, 9, 21]   .....  ]

我正在使用Python 2.7 / Django 1.9。

Answer 1

以下是此问题的解决方案：

def clusterplaces(request):
    allplaces = ThePlace.objects.all()
    centers = [[place.placeLat, place.placeLng] for place in allplaces]
    ids = [place.id for place in allplaces]
    # Epsilon in kms
    epsilon = 9
    arr = np.array(centers)
    dist_matrix = np.array([earth_distance(a, b) for a in arr for b in arr]).reshape(arr.shape[0], arr.shape[0])
    db = DBSCAN(eps=epsilon, min_samples=2, metric='precomputed').fit(dist_matrix)
    id_cluster = [(a, b) for (a, b) in zip(db.labels_, ids)]
    id_cluster.sort()

    for a, group in groupby(id_cluster, key=itemgetter(0)):
        print a, [item[1] for item in list(group)]

    return HttpResponse('Everything went ok')

返回：

-1 [69, 5477, 5578, 5579, 5640, 8357, 8375, 12147, 12294, 13837, 14719, 14916, 14919, 15739, 16128, 16288, 16491, 18765, 18814]
0 [104, 3758, 3759, 3760, 3761]]
1 [18705, 18706, 18707, 18709, 18710, 18711, 18712, 18713, 18714, 18715, 18716, 18717, 18718, 18719, 18720, 18721, 18722, 18723, 18725, 18726, 18727, 18729, 18730, 18731, 18732, 18733, 18734, 18735, 18736, 18737, 18738, 18739, 18740, 18741, 18742, 18744, 18745, 18746, 18747, 18748, 18749, 18750, 18752, 18753, 18754, 18755, 18757, 18758, 18759, 18760, 18761, 18762, 18763, 18764, 18876, 18877]
2 [14723, 14724, 14725, 14801, 14802, 14920, 14922, 14923, 15023]
3 [18799, 18800, 18801, 18805, 18806, 18807]

Answer 2

使用labels。

那个数组是你的标签。每个索引都是对象ID，值是簇号。

labels == 0是第一个群集中的对象ID。

请参阅最后一行，将其用作数据数组X的索引。

请勿在地理坐标上使用StandardScaler。你正在扩展你的世界，这有意义吗？我宁愿使用"haversine"指标，也可以使用以米为单位的距离。

DBSCAN：检索ID而不是纬度/经度

2 个答案: