Question

我有一个两栏的熊猫文件。我尝试对第二列中的数据使用LDA算法，然后打印出每个主题的内容。一切正常，我的输出包含主题及其内容（仅第二列）。但是，我希望我的输出与主题相关，并且超出第二列，也就是第一列。

import pandas
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_components = 2
n_top_words = 5

def print_top_words(model, feature_names, n_top_words):
    out_list = []
    for topic_idx, topic in enumerate(model.components_):
        message = "%d " % topic_idx #aqui que tem que mudar para consertar a saida
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])

        out_list.append(message.split())
    return out_list

text = pandas.read_csv('listes.csv', encoding = 'utf-8')
text_liste2 = text['liste2']
text_liste1 = text['liste1']
text_liste1_list = text_liste1.values.tolist()
text_liste2_list = text_liste2.values.tolist()

tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(text_liste2_list)
tf_feature_names = tf_vectorizer.get_feature_names()

lda = LatentDirichletAllocation(n_components=n_components,     max_iter=5,learning_method='online',learning_offset=50.,random_state=0)

lda.fit(tf)

#print docs par topic - Funciona
document_topics = lda.fit_transform(tf)
topicos = print_top_words(lda, tf_feature_names, n_top_words)
for i in range(len(topicos)):
    print("Topic {}:".format(i))
    docs = np.argsort(document_topics[:, i])[::-1]
    for j in docs[:3]:
       print " ".join(text_liste2_list[j].encode('utf-8').split(",")[:2])

数据

liste1,liste2
'hello, how are you','hello'
'I am super intelligent','super intelligent'
'He is a great friend','great friend'
'THE book is on the table','book table'
'the EARTH is in danger','earth danger'
'I just can say goodbye','just goodbye' 
'she eats bananas','eats bananas'
'you say goodbye','say goodbye'

我的输出：

Topic 0:

book table
earth danger
just goodbye 
eats bananas

Topic 1:

hello
super intelligent
great friend
say goodbye

好的输出：

Topic 0:
'THE book is on the table','book table'
'the EARTH is in danger','earth danger'
'I just can say goodbye','just goodbye' 
'she eats bananas','eats bananas

Topic 1:
'hello, how are you','hello'
'I am super intelligent','super intelligent'
'He is a great friend','great friend''
'you say goodbye','say goodbye'

Answer 1

首先，除去Hello, how are you中第一行的逗号。其次，只需在上一次打印中打印text_liste1_list[j]即可：-）：

for j in docs[:3]:
   str2 = " ".join(text_liste2_list[j].encode('utf-8').split(",")[:2])
   print(text_liste1_list[j] + ',' + str2)

仅在脚本中使用一列，但还要打印具有相同索引的另一列

1 个答案: