Question

import pandas as pd, numpy as np, scipy
import sklearn.feature_extraction.text as text
from sklearn import decomposition

descs = ["You should not go there", "We may go home later", "Why should we do your chores", "What should we do"]

vectorizer = text.CountVectorizer()

dtm = vectorizer.fit_transform(descs).toarray()

vocab = np.array(vectorizer.get_feature_names())

nmf = decomposition.NMF(3, random_state = 1)

topic = nmf.fit_transform(dtm)

打印topic让我：

>>> print(topic)
[0.       , 1.403    , 0.     ],
[0.       , 0.       , 1.637  ],
[1.257    , 0.       , 0.     ],
[0.874    , 0.056    , 0.065  ]

descs中每个元素的向量是属于某个群集的可能性。如何获得每个簇的质心坐标？最后，我想开发一个函数来计算descs中每个元素距它所分配的簇的质心的距离。

最好只计算每个群集的每个descs元素的topic值的平均值吗？

Answer 1

sklearn.decomposition.NMF的{{3}}解释了如何获取每个群集的质心坐标：

属性 组件_ ：数组，[n_components，n_features]
数据的非负组件。

基础向量按行排列，如以下交互式会话所示：

In [995]: np.set_printoptions(precision=2)

In [996]: nmf.components_
Out[996]: 
array([[ 0.54,  0.91,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.89,  0.  ,  0.89,  0.37,  0.54,  0.  ,  0.54],
       [ 0.  ,  0.01,  0.71,  0.  ,  0.  ,  0.  ,  0.71,  0.72,  0.71,  0.01,  0.02,  0.  ,  0.71,  0.  ],
       [ 0.  ,  0.01,  0.61,  0.61,  0.61,  0.61,  0.  ,  0.  ,  0.  ,  0.62,  0.02,  0.  ,  0.  ,  0.  ]])

关于你的第二个问题，我没有看到“计算每个的每个 descs 元素的主题值的平均值”。在我看来，通过计算的可能性进行分类更有意义。

Sklearn：找到群集的平均质心位置？

1 个答案: