Question

我有一个单词分布分析算法。它为每个目标单词生成观察向量，并从该表中使用stats.spearmanr（）计算距离（从[-1,1]缩放为[0,1]），从而生成距离矩阵（Y）。然后，我使用hierarchy.average（）获得聚类（Z）。最后，生成并绘制树状图。

我的问题是：树状图的规模随目标单词的数量而变化。我假设它的距离轴沿[0,1]范围变化（通过spearmanr（）获得（并缩放）），如上所述。但是对于50个单词来说，它是[0，0.5]，对于150个单词来说是[0，1]，对于1000个单词来说是[0，2]。

为什么会这样（距离标度的值大于Y中的值）？

我很乐意就此问题提出任何想法，因为我似乎在文档和网络上都找不到任何提示（这使我担心会提出错误的问题……）。而且，我需要固定比例或至少一种方法来了解树状图正在使用哪个树，以用于级别指定。预先感谢您的帮助。

简化代码：

# coding: utf-8

# Estatísticas e visualização
import numpy as np
import scipy, random
import scipy.stats

# Clusterização e visualização do dendrograma
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt


def remap(x, in_min, in_max, out_min, out_max):
  return (x - in_min) * (out_max - out_min) / (in_max - in_min) + out_min

random.seed('7622')

sizes = [50, 250, 500, 1000, 2000]

for n in sizes:
  # Generate observation matrix
  X = []
  for i in range(n):
    vet = []
    for j in range(300):
      # Generate random observations
      vet.append(random.randint(0, 50))

    X.append(vet)

  # X is a matrix where lines are variables (target words) and columns are observations (contexts of occurrence)
  Y = scipy.stats.spearmanr(X, axis=1)

  # Y rescaling
  for i in range(len(Y[0])):
    Y[0][i] = [ remap(v, -1, 1, 0, 1) for v in Y[0][i] ]

  print 'Y [', np.matrix(Y[0]).min(), ',', np.matrix(Y[0]).max(), ']'

  # Clustering
  Z = hac.average(Y[0])

  print 'n=', n, \
        'Z [', min([ el[2] for el in Z ]), ',', max([ el[2] for el in Z ]), ']'

[UPDATE]以上代码的结果：

Y [ 0.401120498124 , 1.0 ]
n= 50 Z [ 0.634408300876 , 0.77633631869 ]
Y [ 0.379375733574 , 1.0 ]
n= 250 Z [ 0.775241869849 , 0.969704246048 ]
Y [ 0.37559031365 , 1.0 ]
n= 500 Z [ 0.935671154717 , 1.16505319575 ]
Y [ 0.370600337649 , 1.0 ]
n= 1000 Z [ 1.19646327361 , 1.47897594053 ]
Y [ 0.359010408057 , 1.0 ]
n= 2000 Z [ 1.56890165007 , 1.96898566034 ]

为什么Scipy树状图距离轴刻度会随变量数量而变化？

0 个答案: