为什么Scipy树状图距离轴刻度会随变量数量而变化?

时间:2018-07-04 14:11:05

标签: python scipy correlation hierarchical-clustering dendrogram

我有一个单词分布分析算法。它为每个目标单词生成观察向量,并从该表中使用stats.spearmanr()计算距离(从[-1,1]缩放为[0,1]),从而生成距离矩阵(Y)。然后,我使用hierarchy.average()获得聚类(Z)。最后,生成并绘制树状图。

我的问题是:树状图的规模随目标单词的数量而变化。我假设它的距离轴沿[0,1]范围变化(通过spearmanr()获得(并缩放)),如上所述。但是对于50个单词来说,它是[0,0.5],对于150个单词来说是[0,1],对于1000个单词来说是[0,2]。

为什么会这样(距离标度的值大于Y中的值)?

我很乐意就此问题提出任何想法,因为我似乎在文档和网络上都找不到任何提示(这使我担心会提出错误的问题……)。而且,我需要固定比例或至少一种方法来了解树状图正在使用哪个树,以用于级别指定。预先感谢您的帮助。

简化代码:

# coding: utf-8

# Estatísticas e visualização
import numpy as np
import scipy, random
import scipy.stats

# Clusterização e visualização do dendrograma
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt


def remap(x, in_min, in_max, out_min, out_max):
  return (x - in_min) * (out_max - out_min) / (in_max - in_min) + out_min

random.seed('7622')

sizes = [50, 250, 500, 1000, 2000]

for n in sizes:
  # Generate observation matrix
  X = []
  for i in range(n):
    vet = []
    for j in range(300):
      # Generate random observations
      vet.append(random.randint(0, 50))

    X.append(vet)

  # X is a matrix where lines are variables (target words) and columns are observations (contexts of occurrence)
  Y = scipy.stats.spearmanr(X, axis=1)

  # Y rescaling
  for i in range(len(Y[0])):
    Y[0][i] = [ remap(v, -1, 1, 0, 1) for v in Y[0][i] ]

  print 'Y [', np.matrix(Y[0]).min(), ',', np.matrix(Y[0]).max(), ']'

  # Clustering
  Z = hac.average(Y[0])

  print 'n=', n, \
        'Z [', min([ el[2] for el in Z ]), ',', max([ el[2] for el in Z ]), ']'

[UPDATE]以上代码的结果:

Y [ 0.401120498124 , 1.0 ]
n= 50 Z [ 0.634408300876 , 0.77633631869 ]
Y [ 0.379375733574 , 1.0 ]
n= 250 Z [ 0.775241869849 , 0.969704246048 ]
Y [ 0.37559031365 , 1.0 ]
n= 500 Z [ 0.935671154717 , 1.16505319575 ]
Y [ 0.370600337649 , 1.0 ]
n= 1000 Z [ 1.19646327361 , 1.47897594053 ]
Y [ 0.359010408057 , 1.0 ]
n= 2000 Z [ 1.56890165007 , 1.96898566034 ]

0 个答案:

没有答案