我设法使用DBSCAN从我的数据库中聚集了数千个地理位置。如何检索包含每个群集的ID的数组,而不是纬度/经度列表?
更确切地说:
我有一个包含数千个地点的数据库(模型ThePlace):
id | placeLat | placeLng<BR>
1 | -0.72840701 | 1.07480303<BR>
2 | 0.56603302 | -0.71806147<BR>
3 | -0.85542777 | 0.80393827<BR>
4 | 0.6079188 | -0.65524001<BR>
5 | -0.68533746 | 0.5591115<BR>
6 | 0.54826708 | -0.80626836<BR>
7 | 0.89279842 | -0.68575192<BR>
8 | 0.46384115 | -0.66288763<BR>
................
Etc等。
以下是Django中的代码:
allplaces = ThePlace.objects.all()
centers = [[place.placeLat, place.placeLng] for place in allplaces]
print(centers)
返回:
[[69.6140162630014, 26.8535041809082], [10.791441, 79.1368305], [52.6237376, -3.83939629999998], [21.6229701, -81.5629847], [46.798924, -71.224765], [31.5051046, -5.9447371], #etc...
然后我实现了DBSCAN demo建议的代码:
X, labels_true = make_blobs(n_samples=numberofplaces, centers=centers, cluster_std=0.4, random_state=0)
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.01, min_samples=2).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print("1st cluster:")
print(X[labels == 0])
返回:
[[ 0.69845443 -0.12814653] [ 0.64770332 -0.14093706] [ 0.69437909 -0.13627011] [ 0.67780877 -0.12647872] [ 0.71573886 -0.09318022] [ 0.6779438 -0.13639582]]
我想获得每个位置的 ID 的数组,而不是具有纬度/经度的数组。有可能吗?
例如,我想得到:
[[ 2, 4, 5, 12] [3, 7, 11] [5, 9, 21] ..... ]
我正在使用Python 2.7 / Django 1.9。
答案 0 :(得分:2)
以下是此问题的解决方案:
def clusterplaces(request):
allplaces = ThePlace.objects.all()
centers = [[place.placeLat, place.placeLng] for place in allplaces]
ids = [place.id for place in allplaces]
# Epsilon in kms
epsilon = 9
arr = np.array(centers)
dist_matrix = np.array([earth_distance(a, b) for a in arr for b in arr]).reshape(arr.shape[0], arr.shape[0])
db = DBSCAN(eps=epsilon, min_samples=2, metric='precomputed').fit(dist_matrix)
id_cluster = [(a, b) for (a, b) in zip(db.labels_, ids)]
id_cluster.sort()
for a, group in groupby(id_cluster, key=itemgetter(0)):
print a, [item[1] for item in list(group)]
return HttpResponse('Everything went ok')
返回:
-1 [69, 5477, 5578, 5579, 5640, 8357, 8375, 12147, 12294, 13837, 14719, 14916, 14919, 15739, 16128, 16288, 16491, 18765, 18814]
0 [104, 3758, 3759, 3760, 3761]]
1 [18705, 18706, 18707, 18709, 18710, 18711, 18712, 18713, 18714, 18715, 18716, 18717, 18718, 18719, 18720, 18721, 18722, 18723, 18725, 18726, 18727, 18729, 18730, 18731, 18732, 18733, 18734, 18735, 18736, 18737, 18738, 18739, 18740, 18741, 18742, 18744, 18745, 18746, 18747, 18748, 18749, 18750, 18752, 18753, 18754, 18755, 18757, 18758, 18759, 18760, 18761, 18762, 18763, 18764, 18876, 18877]
2 [14723, 14724, 14725, 14801, 14802, 14920, 14922, 14923, 15023]
3 [18799, 18800, 18801, 18805, 18806, 18807]
答案 1 :(得分:1)
使用labels
。
那个数组是你的标签。每个索引都是对象ID,值是簇号。
labels == 0
是第一个群集中的对象ID。
请参阅最后一行,将其用作数据数组X
的索引。
请勿在地理坐标上使用StandardScaler
。你正在扩展你的世界,这有意义吗?我宁愿使用"haversine"
指标,也可以使用以米为单位的距离。