从头开始分裂聚类

时间:2015-08-18 12:02:15

标签: python cluster-analysis hierarchical-clustering

我从头开始编程分裂(自上而下)聚类。在分裂聚类中,我们从顶部开始,在一个集群中包含所有示例(变量)。然后递归拆分聚类,直到每个示例都在其单例聚类中。

我使用Pearson相关系数作为分裂聚类的度量。粘贴在下面是我最初的尝试。我读了相关系数的数据和计算矩阵。

现在我们需要根据相关系数的最小值来分割第一个聚类。知道怎么办吗?欢迎任何指示和建议。

import pandas as pd
from math import sqrt

# Read data from GitHub
df = pd.read_csv('https://raw.githubusercontent.com/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0)
data = df.values.tolist()
data = data[1:10]

# Define correlation coefficient
def pearson(v1, v2):
  # Simple sums
  sum1 = sum(v1)
  sum2 = sum(v2)
  # Sums of the squares
  sum1Sq = sum([pow(v, 2) for v in v1])
  sum2Sq = sum([pow(v, 2) for v in v2]) 
  # Sum of the products
  pSum=sum([v1[i] * v2[i] for i in range(len(v1))])
  # Calculate r (Pearson score)
  num = pSum - (sum1 * sum2 / len(v1))
  den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
  if den == 0: return 0
  return num / den


# Dict for distances
dist={}
min_dist = pearson(data[0], data[0])
# Loop over upper triangle of data matrix
for i in range(len(data)):
  for j in range(i + 1, len(data)):
     # Compute distance for each pair
     dist_curr = pearson(data[i], data[j])
     # Store distance in dict
     dist[(i, j)] = dist_curr
     # Store min distance
     if dist_curr < min_dist:
       min_dist = dist_curr

0 个答案:

没有答案