我从头开始编程分裂(自上而下)聚类。在分裂聚类中,我们从顶部开始,在一个集群中包含所有示例(变量)。然后递归拆分聚类,直到每个示例都在其单例聚类中。
我使用Pearson相关系数作为分裂聚类的度量。粘贴在下面是我最初的尝试。我读了相关系数的数据和计算矩阵。
现在我们需要根据相关系数的最小值来分割第一个聚类。知道怎么办吗?欢迎任何指示和建议。
import pandas as pd
from math import sqrt
# Read data from GitHub
df = pd.read_csv('https://raw.githubusercontent.com/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0)
data = df.values.tolist()
data = data[1:10]
# Define correlation coefficient
def pearson(v1, v2):
# Simple sums
sum1 = sum(v1)
sum2 = sum(v2)
# Sums of the squares
sum1Sq = sum([pow(v, 2) for v in v1])
sum2Sq = sum([pow(v, 2) for v in v2])
# Sum of the products
pSum=sum([v1[i] * v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num = pSum - (sum1 * sum2 / len(v1))
den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
if den == 0: return 0
return num / den
# Dict for distances
dist={}
min_dist = pearson(data[0], data[0])
# Loop over upper triangle of data matrix
for i in range(len(data)):
for j in range(i + 1, len(data)):
# Compute distance for each pair
dist_curr = pearson(data[i], data[j])
# Store distance in dict
dist[(i, j)] = dist_curr
# Store min distance
if dist_curr < min_dist:
min_dist = dist_curr