通读一遍,我发现可以将预先计算的距离矩阵传递到SKLearn DBSCAN中。不幸的是,我不知道如何将其传递给计算。
说我有一个包含100个元素的1D数组,上面只有节点的名称。然后,我得到一个二维矩阵100x100,每个元素之间的距离(顺序相同)。
我知道我必须称呼它
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
2个节点之间的距离和至少5个节点群集之间的距离。同样,使用“预计算”来指示使用2D矩阵。但是,如何传递信息进行计算?
如果使用RAPIDS CUML DBScan函数(GPU加速),可能会出现相同的问题。
答案 0 :(得分:1)
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None) [...]
[...] metricstring, or callable, default=’euclidean’ The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a Glossary, in which case only “nonzero” elements may be considered neighbors for DBSCAN. [...]
所以,通常的称呼方式是:
from sklearn.cluster import DBSCAN
clustering = DBSCAN()
DBSCAN.fit(X)
如果您有距离矩阵,则可以:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(metric='precomputed')
clustering.fit(distance_matrix)
答案 1 :(得分:0)
好的,我已经尝试了您的建议,但是没有用。我正在关注这篇文章:
这是我当前的代码状态。作为DBSCAN调用的一部分计算距离时,它可以作为冠军。尝试使用预计算进行相同操作时,出现错误。
#!/usr/bin/env python3
import pandas as pd, numpy as np
from numpy import load
from sklearn.neighbors import DistanceMetric
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
from numpy import save
kms_per_radian = 6371.0088
#Loading GPS coordinates array
cities_df = pd.read_csv('geo.csv',delimiter=';',header=0)
reducida = cities_df.copy()
coords = reducida.drop(columns=['ID'])
#Transforming to radians
cities_df['lat'] = np.radians(cities_df['lat'])
cities_df['lon'] = np.radians(cities_df['lon'])
#Function to compute distance. This is simulated, in reality we will use a different procedure
dist = DistanceMetric.get_metric('haversine')
cities_df[['lat','lon']].to_numpy()
matrix = pd.DataFrame(dist.pairwise(cities_df[['lat','lon']].to_numpy())*kms_per_radian, columns=cities_df.ID.unique(), index=cities_df.ID.unique())
# Saving distance matrix to disk
# save('data.npy',matrix)
epsilon = 1 / kms_per_radian
# Load matrix
#matrix = load('data.npy')
# Computing clusters with precomputed matrix, DOESN'T WORK
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='precomputed')
db.fit(distancias)
# Instead if I use the embedded haversine function IT DOES WORK. Should have the same result
# db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
print('Number of clusters: {}'.format(num_clusters))
这是我得到的错误:
File "./DBSCAN.py", line 41, in <module>
db.fit(matrix)
File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/cluster/_dbscan.py", line 330, in fit
neighbors_model = NearestNeighbors(
File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/neighbors/_unsupervised.py", line 113, in __init__
super().__init__(
File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/neighbors/_base.py", line 305, in __init__
self._check_algorithm_metric()
File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/neighbors/_base.py", line 332, in _check_algorithm_metric
raise ValueError("Metric '%s' not valid. Use "
ValueError: Metric 'precomputed' not valid. Use sorted(sklearn.neighbors.VALID_METRICS['ball_tree']) to get valid options. Metric can also be a callable function.
这是我的距离矩阵:
791 794 798 1124 1125
791 0.000000 6.091447 35.342980 42.952046 29.158508
794 6.091447 0.000000 29.394745 39.452365 23.151700
798 35.342980 29.394745 0.000000 41.346497 12.675131
1124 42.952046 39.452365 41.346497 0.000000 29.392357
1125 29.158508 23.151700 12.675131 29.392357 0.000000
这只是数据的一个子集,实际上我有一个更大的矩阵。数字是列和行的ID(也许有错误)
答案 2 :(得分:0)
这是我跑步时得到的:
pip install -U scikit-learn
Requirement already up-to-date: scikit-learn in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (0.23.1)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (from scikit-learn) (0.15.1)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (from scikit-learn) (2.1.0)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (from scikit-learn) (1.5.0)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (from scikit-learn) (1.19.0)