Question

我正在尝试通过日语单词/术语进行层次聚类，并使用scipy.cluster.hierarchy.dendrogram绘制结果。但是，该图不能显示日语单词/术语，而是使用小矩形。起初，我认为这可能是因为当我创建字典时，键是unicode而不是日语（正如我问的问题here）。然后我被建议使用Python3来解决这个问题，我最后用日语单词而不是unicode制作字典键（我问的问题here）。但是，事实证明，即使我使用日语单词/术语提供label scipy.cluster.hierarchy.dendrogram参数，情节仍然无法显示这些单词。我检查了几个类似的posts，但似乎仍然没有明确的解决方案。我的代码如下：

import pandas as pd
import numpy as np
from sklearn import decomposition
from sklearn.cluster import AgglomerativeClustering as hicluster
from scipy.spatial.distance import cdist, pdist
from scipy import sparse as sp ## Sparse Matrix
from scipy.cluster.hierarchy import dendrogram
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')

## Import Data
allWrdMat10 = pd.read_csv("../../data/allWrdMat10.csv.gz", 
    encoding='CP932')

## Set X as CSR Sparse Matrix 
X = np.array(allWrdMat10)
X = sp.csr_matrix(X)

def plot_dendrogram(model, **kwargs):
    # Children of hierarchical clustering
    children = model.children_

    # Distances between each pair of children
    # Since we don't have this information, we can use a uniform one 
      for plotting
    distance = np.arange(children.shape[0])

    # The number of observations contained in each cluster level
    no_of_observations = np.arange(2, children.shape[0]+2)

    # Create linkage matrix and then plot the dendrogram
    linkage_matrix = np.column_stack([children, distance, 
        no_of_observations]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

dict_index = {t:i for i,t in enumerate(allWrdMat10.columns)}

dictlist = []
temp = []
akey = []
avalue = []

for key, value in dict_index.items():
    akey.append(key)
    avalue.append(value)
    temp = [key,value]
    dictlist.append(temp)

avalue = np.array(avalue)

X_transform = X[:, avalue < 1000].transpose().toarray()

freq1000terms = akey
freq1000terms = np.array(freq1000terms)[avalue < 1000]

hicl_ward = hicluster(n_clusters=40,linkage='ward', compute_full_tree = 
    False)
hiclwres = hicl_ward.fit(X_transform)

plt.rcParams["figure.figsize"] = (15,6)

model1 = hiclwres
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)')
plot_dendrogram(model1, p = 40, truncate_mode = 'lastp', orientation = 
    'top', labels=freq1000terms[model1.labels_], color_threshold = 991)
plt.ylim(959,1000)
plt.show()

Answer 1

你需要给matplotlib一个有效的字体来显示日文字符。您可以使用以下代码从系统中找到可用的字体：

import matplotlib.font_manager
matplotlib.font_manager.findSystemFonts(fontpaths=None)

它将为您提供matplotlib可以使用的系统字体列表：

['c:\\windows\\fonts\\seguisli.ttf',
 'C:\\WINDOWS\\Fonts\\BOD_R.TTF',
 'C:\\WINDOWS\\Fonts\\GILC____.TTF',
 'c:\\windows\\fonts\\segoewp-light.ttf',
 'c:\\windows\\fonts\\glsnecb.ttf',
 ...
 ...
 'c:\\windows\\fonts\\elephnti.ttf',
 'C:\\WINDOWS\\Fonts\\COPRGTB.TTF']

选择支持日语字符编码的字体，并将其作为参数提供给代码开头的matplotlib，如下所示：

import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "Yu Gothic" # I.E Yu Gothic, supports shift-jis

这是一个全局参数设置，同一项目上的其他图也将使用相同的字体系列。如果要为单个文本更改它，可以使用matplotlib文本对象的font properties。

另外：如果找不到/看到合适的字体，你可以下载像code2000这样的字体，安装它并以相同的方式使用它。（对于列表中显示的字体，您可能需要清除matplotlib的缓存）

如何使Scipy树形图阅读日语单词/术语

1 个答案: