Question

我已经为此花了一个星期的时间。

我想

运行NMF主题建模
通过查看权重的最大值，为每个文档分配一个主题，
使用matplot将分布图绘制为％条形图。（即：X轴上的主题，y轴上该主题的％文档。）

以下是一些玩具数据，并完成了步骤1和2：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import pandas as pd

# Get data
data = {
    "Documents": ["I am a document", 
                  "And me too", 
                  "The cat is big",
                  "The dog is big"
                  "My headphones are large", 
                  "My monitor has rabies", 
                  "My headphones are loud"
                  "The street is loud "]
}

df = pd.DataFrame(data)

# Fit a TFIDF vectorizer 
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df['Documents'])

# Run NMF
nmf_model = NMF(n_components=4, random_state=1).fit(tfidf)

# Weights
W = nmf_model.transform(tfidf)

# Topics
H= nmf_model.components_

现在这是我如何将文档分配给topcic的方法：

# Will return document topics as list like [1, 4, 1...] to 
# represent that the first document is topic 1, the second 4, and so on. 
topics = pd.DataFrame(W).idxmax(axis=1, skipna=True).tolist()

现在我应该可以通过这两种结构得到想要的东西，但是我很茫然。

Answer 1

看起来像Counter（）的用例。我会这样写：

from collections import Counter

mylist = [1,1,1,1,2,2,3,1,1,2,3,1,1,1]
mycount = Counter(mylist)
for key,value in mycount.items():
    print(key,value)

这将以以下结构输出您的主题：

1 9
2 3
3 2

潜在狄利克雷/非负矩阵要注意的一件事是，整个要点是一个由多个主题组成的句子。将权重分配给单个主题的最大权重可能会破坏目标。您可能还需要考虑如何处理无意义的句子，因为您的算法会自动将它们当前分配给一个主题。

Answer 2

IIUC，您想绘制条形图，所以不要将主题更改为列表：

topics = pd.DataFrame(W).idxmax(axis=1, skipna=True)

plt.bar(x=topics.index, height=topics.mul(100)/topics.sum())
plt.show()

给予：

如何通过主题建模制作主题的百分比条形图？

2 个答案: