主题建模中的指令解释

时间:2017-09-16 16:00:54

标签: python python-3.x text-mining lda topic-modeling

我对主题建模(lda)有疑问。

我不完全理解主题建模的原理,所以这个问题可能看起来很奇怪。

这句话是随机的,它是高频率(概率)吗?

test = ranking[:5]

这句话的确切含义是什么?

我的代码获取了与文档数量一样多的主题(我听说不可能减少文档数量)。我只提取了一部分,有人说代表,有人说频率很高,我不知道原理。

import os

import numpy as np
import sklearn.feature_extraction.text as text

from sklearn import decomposition

CORPUS_PATH = os.path.join('data', 'test')
filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in 
os.listdir(CORPUS_PATH)])

len(filenames)
filenames[:5]
print(filenames)

vectorizer = text.CountVectorizer(input='filename', stop_words='english', 
min_df=20, encoding='iso-8859-1')
dtm = vectorizer.fit_transform(filenames).toarray()
vocab = np.array(vectorizer.get_feature_names())
dtm.shape
aaa = len(vocab)

num_topics = 20
num_top_words = 20
clf = decomposition.NMF(n_components = num_topics, random_state=1)

doctopic = clf.fit_transform(dtm)

#print words associated with topics
topic_words = []
for topic in clf.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

novel_names = []
for fn in filenames:
    basename = os.path.basename(fn)
    name, ext = os.path.splitext(basename)
    name = name.rstrip('0123456789')
    novel_names.append(name)

novel_names = np.asarray(novel_names)
doctopic_orig = doctopic.copy()

num_groups = len(set(novel_names))

doctopic_grouped = np.zeros((num_groups, num_topics))

for i, name in enumerate(sorted(set(novel_names))):
    doctopic_grouped[i, :] = np.mean(doctopic[novel_names == name, :], axis=0)

doctopic = doctopic_grouped

novels = sorted(set(novel_names))
print("Top NMF topics in...")
for i in range(len(doctopic)):
    top_topics = np.argsort(doctopic[i,:])[::-1][0:3]
    top_topics_str = ' '.join(str(t) for t in top_topics)
    print("{}: {}".format(novels[i], top_topics_str))

for t in range(len(topic_words)):
    print("Topic {}: {}".format(t, ' '.join(topic_words[t][:15])))

austen_indices, cbronte_indices = [], []
for index, fn in enumerate(sorted(set(novel_names))):
    if "Austen" in fn:
        austen_indices.append(index)
    elif "CBronte" in fn:
        cbronte_indices.append(index)

austen_avg = np.mean(doctopic[austen_indices, :], axis=0)
cbronte_avg = np.mean(doctopic[cbronte_indices, :], axis=0)
keyness = np.abs(austen_avg - cbronte_avg)
ranking = np.argsort(keyness)[::-1]
test = ranking[:5]

print(test)

1 个答案:

答案 0 :(得分:1)

ranking[:5]称为切片。它是ranking子列表的副本。它相当于ranking[0:5],并取得列表的前5个元素。这将更详细地解释here。 (在表格中查找,特别是脚注4。)