我有一个单词分布分析算法。它为每个目标单词生成观察向量,并从该表中使用stats.spearmanr()计算距离(从[-1,1]缩放为[0,1]),从而生成距离矩阵(Y)。然后,我使用hierarchy.average()获得聚类(Z)。最后,生成并绘制树状图。
我的问题是:树状图的规模随目标单词的数量而变化。我假设它的距离轴沿[0,1]范围变化(通过spearmanr()获得(并缩放)),如上所述。但是对于50个单词来说,它是[0,0.5],对于150个单词来说是[0,1],对于1000个单词来说是[0,2]。
为什么会这样(距离标度的值大于Y中的值)?
我很乐意就此问题提出任何想法,因为我似乎在文档和网络上都找不到任何提示(这使我担心会提出错误的问题……)。而且,我需要固定比例或至少一种方法来了解树状图正在使用哪个树,以用于级别指定。预先感谢您的帮助。
简化代码:
# coding: utf-8
# Estatísticas e visualização
import numpy as np
import scipy, random
import scipy.stats
# Clusterização e visualização do dendrograma
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
def remap(x, in_min, in_max, out_min, out_max):
return (x - in_min) * (out_max - out_min) / (in_max - in_min) + out_min
random.seed('7622')
sizes = [50, 250, 500, 1000, 2000]
for n in sizes:
# Generate observation matrix
X = []
for i in range(n):
vet = []
for j in range(300):
# Generate random observations
vet.append(random.randint(0, 50))
X.append(vet)
# X is a matrix where lines are variables (target words) and columns are observations (contexts of occurrence)
Y = scipy.stats.spearmanr(X, axis=1)
# Y rescaling
for i in range(len(Y[0])):
Y[0][i] = [ remap(v, -1, 1, 0, 1) for v in Y[0][i] ]
print 'Y [', np.matrix(Y[0]).min(), ',', np.matrix(Y[0]).max(), ']'
# Clustering
Z = hac.average(Y[0])
print 'n=', n, \
'Z [', min([ el[2] for el in Z ]), ',', max([ el[2] for el in Z ]), ']'
[UPDATE]以上代码的结果:
Y [ 0.401120498124 , 1.0 ]
n= 50 Z [ 0.634408300876 , 0.77633631869 ]
Y [ 0.379375733574 , 1.0 ]
n= 250 Z [ 0.775241869849 , 0.969704246048 ]
Y [ 0.37559031365 , 1.0 ]
n= 500 Z [ 0.935671154717 , 1.16505319575 ]
Y [ 0.370600337649 , 1.0 ]
n= 1000 Z [ 1.19646327361 , 1.47897594053 ]
Y [ 0.359010408057 , 1.0 ]
n= 2000 Z [ 1.56890165007 , 1.96898566034 ]