我有一个两栏的熊猫文件。我尝试对第二列中的数据使用LDA算法,然后打印出每个主题的内容。一切正常,我的输出包含主题及其内容(仅第二列)。但是,我希望我的输出与主题相关,并且超出第二列,也就是第一列。
import pandas
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
n_components = 2
n_top_words = 5
def print_top_words(model, feature_names, n_top_words):
out_list = []
for topic_idx, topic in enumerate(model.components_):
message = "%d " % topic_idx #aqui que tem que mudar para consertar a saida
message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
out_list.append(message.split())
return out_list
text = pandas.read_csv('listes.csv', encoding = 'utf-8')
text_liste2 = text['liste2']
text_liste1 = text['liste1']
text_liste1_list = text_liste1.values.tolist()
text_liste2_list = text_liste2.values.tolist()
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(text_liste2_list)
tf_feature_names = tf_vectorizer.get_feature_names()
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,learning_method='online',learning_offset=50.,random_state=0)
lda.fit(tf)
#print docs par topic - Funciona
document_topics = lda.fit_transform(tf)
topicos = print_top_words(lda, tf_feature_names, n_top_words)
for i in range(len(topicos)):
print("Topic {}:".format(i))
docs = np.argsort(document_topics[:, i])[::-1]
for j in docs[:3]:
print " ".join(text_liste2_list[j].encode('utf-8').split(",")[:2])
数据
liste1,liste2
'hello, how are you','hello'
'I am super intelligent','super intelligent'
'He is a great friend','great friend'
'THE book is on the table','book table'
'the EARTH is in danger','earth danger'
'I just can say goodbye','just goodbye'
'she eats bananas','eats bananas'
'you say goodbye','say goodbye'
我的输出:
Topic 0:
book table
earth danger
just goodbye
eats bananas
Topic 1:
hello
super intelligent
great friend
say goodbye
好的输出:
Topic 0:
'THE book is on the table','book table'
'the EARTH is in danger','earth danger'
'I just can say goodbye','just goodbye'
'she eats bananas','eats bananas
Topic 1:
'hello, how are you','hello'
'I am super intelligent','super intelligent'
'He is a great friend','great friend''
'you say goodbye','say goodbye'
答案 0 :(得分:2)
首先,除去Hello, how are you
中第一行的逗号。
其次,只需在上一次打印中打印text_liste1_list[j]
即可:-):
for j in docs[:3]:
str2 = " ".join(text_liste2_list[j].encode('utf-8').split(",")[:2])
print(text_liste1_list[j] + ',' + str2)