Question

我正在尝试聚类，我很惊讶它看起来有多慢。我制作了一个随机图，其中包含30个社区，每个社区包含30个节社区中的节点有90％的连接机会，不在同一社区的节点之间的边缘有10％的连接机会。我在两个节点之间测量类似的邻居之间的Jaccard similarity。

这个玩具示例仅在dbscan部分上花费大约15秒，如果增加节点数，这个增长非常快。由于总共只有900个节点，这似乎非常慢。

from __future__ import division
import numpy as np
from sklearn.cluster import dbscan
import networkx as nx
import matplotlib.pyplot as plt
import time

#Define the Jaccard distance. Following example for clustering with Levenshtein distance from from http://scikit-learn.org/stable/faq.html
def jaccard_distance(x,y):
    return 1 - len(neighbors[x].intersection(neighbors[y]))/len(neighbors[x].union(neighbors[y]))

def jaccard_metric(x,y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return jaccard_distance(i, j)

#Simulate a planted partition graph. The simplest form of community detection benchmark.
num_communities = 30
size_of_communities = 30
print "planted partition"
G = nx.planted_partition_graph(num_communities, size_of_communities, 0.9, 0.1,seed=42)

#Make a hash table of sets of neighbors for each node.
neighbors={}
for n in G:
    for nbr in G[n]:
        if not (n in neighbors):
            neighbors[n] = set()
        neighbors[n].add(nbr)

print "Made data"

X= np.arange(len(G)).reshape(-1,1)

t = time.time()
db = dbscan(X, metric = jaccard_metric, eps=0.85, min_samples=2)
print db

print "Clustering took ", time.time()-t, "seconds"

如何让这个可扩展到更多的节点？

Answer 1

这是一个解决方案，可以在我的机器上加速DBSCAN调用大约1890倍：

# the following code should be added to the question's code (it uses G and db)

import igraph

# use igraph to calculate Jaccard distances quickly
edges = zip(*nx.to_edgelist(G))
G1 = igraph.Graph(len(G), zip(*edges[:2]))
D = 1 - np.array(G1.similarity_jaccard(loops=False))

# DBSCAN is much faster with metric='precomputed'
t = time.time()
db1 = dbscan(D, metric='precomputed', eps=0.85, min_samples=2)
print "clustering took %.5f seconds" %(time.time()-t)

assert np.array_equal(db, db1)

这里输出：

...
Clustering took  8.41049790382 seconds
clustering took 0.00445 seconds

使用DBSCAN进行群集的速度非常慢

1 个答案: