Question

通读一遍，我发现可以将预先计算的距离矩阵传递到SKLearn DBSCAN中。不幸的是，我不知道如何将其传递给计算。

说我有一个包含100个元素的1D数组，上面只有节点的名称。然后，我得到一个二维矩阵100x100，每个元素之间的距离（顺序相同）。

我知道我必须称呼它

db = DBSCAN(eps=2, min_samples=5, metric="precomputed")

2个节点之间的距离和至少5个节点群集之间的距离。同样，使用“预计算”来指示使用2D矩阵。但是，如何传递信息进行计算？

如果使用RAPIDS CUML DBScan函数（GPU加速），可能会出现相同的问题。

Answer 1

class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', 
metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
[...]

[...]
metricstring, or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If 
metric is a string or callable, it must be one of the options allowed by 
sklearn.metrics.pairwise_distances for its metric parameter. If metric is 
“precomputed”, X is assumed to be a distance matrix and must be square. X may be a 
Glossary, in which case only “nonzero” elements may be considered neighbors for  
DBSCAN.
[...]

所以，通常的称呼方式是：

from sklearn.cluster import DBSCAN

clustering = DBSCAN()
DBSCAN.fit(X)

如果您有距离矩阵，则可以：

from sklearn.cluster import DBSCAN

clustering = DBSCAN(metric='precomputed')
clustering.fit(distance_matrix)

Answer 2

好的，我已经尝试了您的建议，但是没有用。我正在关注这篇文章：

https://kanoki.org/2019/12/27/how-to-calculate-distance-in-python-and-pandas-using-scipy-spatial-and-distance-functions/

这是我当前的代码状态。作为DBSCAN调用的一部分计算距离时，它可以作为冠军。尝试使用预计算进行相同操作时，出现错误。

#!/usr/bin/env python3

import pandas as pd, numpy as np
from numpy import load
from sklearn.neighbors import DistanceMetric
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
from numpy import save

kms_per_radian = 6371.0088

#Loading GPS coordinates array
cities_df = pd.read_csv('geo.csv',delimiter=';',header=0)

reducida = cities_df.copy()
coords = reducida.drop(columns=['ID'])

#Transforming to radians
cities_df['lat'] = np.radians(cities_df['lat'])
cities_df['lon'] = np.radians(cities_df['lon'])

#Function to compute distance. This is simulated, in reality we will use a different procedure
dist = DistanceMetric.get_metric('haversine')

cities_df[['lat','lon']].to_numpy()

matrix = pd.DataFrame(dist.pairwise(cities_df[['lat','lon']].to_numpy())*kms_per_radian,  columns=cities_df.ID.unique(), index=cities_df.ID.unique())

# Saving distance matrix to disk
# save('data.npy',matrix)


epsilon = 1 / kms_per_radian

# Load matrix
#matrix = load('data.npy')

# Computing clusters with precomputed matrix, DOESN'T WORK
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='precomputed')
db.fit(distancias)

# Instead if I use the embedded haversine function IT DOES WORK. Should have the same result
# db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))


cluster_labels = db.labels_

num_clusters = len(set(cluster_labels))

clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])

print('Number of clusters: {}'.format(num_clusters))

这是我得到的错误：

  File "./DBSCAN.py", line 41, in <module>
    db.fit(matrix)
  File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/cluster/_dbscan.py", line 330, in fit
    neighbors_model = NearestNeighbors(
  File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/neighbors/_unsupervised.py", line 113, in __init__
    super().__init__(
  File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/neighbors/_base.py", line 305, in __init__
    self._check_algorithm_metric()
  File "/Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages/sklearn/neighbors/_base.py", line 332, in _check_algorithm_metric
    raise ValueError("Metric '%s' not valid. Use "
ValueError: Metric 'precomputed' not valid. Use sorted(sklearn.neighbors.VALID_METRICS['ball_tree']) to get valid options. Metric can also be a callable function.

这是我的距离矩阵：

           791        794        798        1124       1125
791    0.000000   6.091447  35.342980  42.952046  29.158508
794    6.091447   0.000000  29.394745  39.452365  23.151700
798   35.342980  29.394745   0.000000  41.346497  12.675131
1124  42.952046  39.452365  41.346497   0.000000  29.392357
1125  29.158508  23.151700  12.675131  29.392357   0.000000

这只是数据的一个子集，实际上我有一个更大的矩阵。数字是列和行的ID（也许有错误）

Answer 3

这是我跑步时得到的：

pip install -U scikit-learn
Requirement already up-to-date: scikit-learn in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (0.23.1)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (from scikit-learn) (0.15.1)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (from scikit-learn) (2.1.0)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (from scikit-learn) (1.5.0)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /Users/jnebrera/.pyenv/versions/3.8.3/lib/python3.8/site-packages (from scikit-learn) (1.19.0)

DBSCAN中的预计算距离矩阵

3 个答案: