我正在尝试打印LDA
中每个主题的主题和文本。但是在打印主题后无提示会干扰我的脚本。我可以打印主题,但不能打印文本。
import pandas
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
n_top_words = 5
n_components = 5
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
return message
text = pandas.read_csv('text.csv', encoding = 'utf-8')
text_list = text.values.tolist()
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(text_list)
lda = LatentDirichletAllocation(n_components=n_components, learning_method='batch', max_iter=25, random_state=0)
doc_distr = lda.fit_transform(tf)
tf_feature_names = tf_vectorizer.get_feature_names()
print (print_top_words(lda, tf_feature_names, n_top_words))
doc_distr = lda.fit_transform(tf)
topics = print_top_words(lda, tf_feature_names, n_top_words)
for i in range(len(topics)):
print ("Topic {}:".format(i))
docs = np.argsort(doc_distr[:, i])[::-1]
for j in docs[:10]:
print (" ".join(text_list[j].split(",")[:2]))
我的输出:
Topic 0: no order mail received back
Topic 1: cancel order wishes possible wish
Topic 2: keep current informed delivery order
Topic 3: faulty wooden box present side
Topic 4: delivered received be produced urgent
Topic 5: good waiting day response share
随后出现此错误:
File "lda.py", line 41, in <module>
for i in range(len(topics)):
TypeError: object of type 'NoneType' has no len()
答案 0 :(得分:2)
dput()
函数(至少)存在四个问题。
第一个-导致当前问题的原因是-如果my_tibble
为空,则for循环的主体将不执行,然后您的函数将(隐式)返回my_tibble <- structure(list(fruit = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("Apple",
"Banana", "Orange", "Strawberry"), class = "factor"), length = c(0.530543135476024,
0.488977737310336, 0.503193533328075, 0.560337485188931, 0.533439933009971,
0.611517111445543, 0.784118643975375, 0.362563771715571, 0.999994359802019,
0.956308812233702, 0.332481969543643, 0.562729609348448, 0.635908731579197,
0.565161511593215, 0.526448727581439, 0.429069715902935, 0.460919459557728,
0.444385050459595, 0.503366669668819, 0.618141816193079, 0.516525710744663,
0.481938965057342, 0.505085048888451, 0.457048653556098, 0.536921608675353,
0.511397571854412, 0.442487815464855, 0.50103115023886, 0.305442471161553,
0.424241364519466, 2.45596087585689e-09, 0.122698840602406, 0.131431902209926,
0.205210819820745, 0.154445620769804, 0.161286627937974), weight = c(0.0729778030869548,
0.0460942475327506, 0.0796304213241703, 0.0732813711244074, 0.0882995825748408,
0.127183436952234, 0.0670534170610057, 0.0622813564507915, 0.0290840877242033,
0.0283807418126428, 0.107361724942771, 0.119133737366527, 0.185844270761176,
0.108155205104857, 0.189750275168087, 0.0845939609954818, 0.146490609941214,
0.14150784543994, 0.122840037806175, 0.143552891056291, 0.16798564927051,
0.241024152676673, 0.237508762873311, 0.20455939607561, 0.316350856257808,
0.30730862083812, 0.184386251393058, 0.181923008217247, 0.332024894278287,
0.194530111145869, 0.0166977795512452, 0.0569762924658561, 0.0739793228272142,
0.0433330479654348, 0.099781312832018, 0.0396375225550451), length_sd = c(0.21053610140121,
0.21053610140121, 0.21053610140121, 0.21053610140121, 0.21053610140121,
0.21053610140121, 0.21053610140121, 0.21053610140121, 0.21053610140121,
0.21053610140121, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132,
0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132,
0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.067296241260161,
0.067296241260161, 0.067296241260161, 0.067296241260161, 0.067296241260161,
0.067296241260161, 0.067296241260161, 0.067296241260161, 0.067296241260161,
0.067296241260161, 0.0695477116271205, 0.0695477116271205, 0.0695477116271205,
0.0695477116271205, 0.0695477116271205, 0.0695477116271205),
weight_sd = c(0.0292441784658992, 0.0292441784658992, 0.0292441784658992,
0.0292441784658992, 0.0292441784658992, 0.0292441784658992,
0.0292441784658992, 0.0292441784658992, 0.0292441784658992,
0.0292441784658992, 0.033755823218546, 0.033755823218546,
0.033755823218546, 0.033755823218546, 0.033755823218546,
0.033755823218546, 0.033755823218546, 0.033755823218546,
0.033755823218546, 0.033755823218546, 0.0611975080850528,
0.0611975080850528, 0.0611975080850528, 0.0611975080850528,
0.0611975080850528, 0.0611975080850528, 0.0611975080850528,
0.0611975080850528, 0.0611975080850528, 0.0611975080850528,
0.0290125579882519, 0.0290125579882519, 0.0290125579882519,
0.0290125579882519, 0.0290125579882519, 0.0290125579882519
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -36L), vars = "fruit", labels = structure(list(
fruit = structure(1:4, .Label = c("Apple", "Banana", "Orange",
"Strawberry"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L), vars = "fruit", drop = TRUE), indices = list(0:9, 20:29,
10:19, 30:35), drop = TRUE, group_sizes = c(10L, 10L, 10L,
6L), biggest_group_size = 10L)
。
第二个更微妙:如果print_top_words()
不为空,则该函数将仅返回第一条消息,然后返回并退出-model.components_
语句的定义:返回值(如果未指定值,则返回None
)并退出该函数。
第三个问题是(当model.components_
不为空时),该函数返回一个字符串,其中调用代码显然需要一个列表。这是一个细微的错误,因为字符串具有长度,因此return
上的for循环似乎可以正常工作,但是None
肯定不是您期望的值。
最后,该函数的名称非常错误,因为它不会“打印”任何内容-与前三个问题相比,这似乎微不足道,并且不会阻止代码的确起作用(假设前三个问题是固定),但是代码推理本身就很困难,因此正确命名 很重要,因为它可以大大减少认知负担并简化维护/调试工作。
长话短说:考虑一下您真正希望此功能执行的操作并适当地对其进行修复。由于我不确定您要做什么,因此我不会在此处发布“更正”的版本,但是以上说明应该会有所帮助。
NB:同样,您使用完全相同的参数调用model.components_
和range(len(topics))
两次,这完全没有用,纯粹浪费了处理器周期(在最佳情况下)或发出了气味如果您从第二次调用中获得了不同的结果,则会发现另一个错误。
答案 1 :(得分:1)
您没有提供完整的代码,但是最可能的原因是变量topics
为None。唯一可能发生的方法是,如果model.components_
函数中的print_top_words
是一个空集合,则该循环永远不会运行,并且该函数(隐式)返回None。检查集合的值。更好的是,选择在这种情况下要返回的值。
另一个无关的要点:您在每次迭代中初始化message
变量,然后在每次迭代时将其返回。检查你的意思。
答案 2 :(得分:1)
如果不了解LatentDirichletAllocation
的内部工作原理,这将很难回答。但是,它与components_有关,因为它的重复迭代会产生不同的结果。
您很可能可以通过更改以下内容来避免此错误:
print (print_top_words(lda, tf_feature_names, n_top_words))
doc_distr = lda.fit_transform(tf)
topics = print_top_words(lda, tf_feature_names, n_top_words)
收件人:
temp = print_top_words(lda, tf_feature_names, n_top_words)
print (temp)
doc_distr = lda.fit_transform(tf)
topics = print_top_words(temp)
第二次调用该函数时,model.components_不返回任何内容,因此跳过了循环,该函数不返回任何内容。
但是,我不确定这是否是代码的实际意图。看起来您可能希望print_top_words成为生成器?您将在for循环内返回,从而使其永远不会达到第二次迭代。这可能不是循环的目的。