LDA主题建模输入数据

时间:2015-08-17 16:13:50

标签: python twitter lda topic-modeling

我是python的新手。我刚刚开始研究在推文上使用LDA主题建模的项目。我正在尝试以下代码:

此示例使用在线数据集。我有一个csv文件,其中包含我需要使用的推文。任何人都可以告诉我如何使用我的本地文件?我该如何制作自己的词汇和标题?

我找不到解释如何为LDA准备材料的教程。他们都假设你已经知道如何这样做。



from __future__ import division, print_function

import numpy as np
import lda
import lda.datasets


# document-term matrix

X = lda.datasets.load_reuters()
print("type(X): {}".format(type(X)))
print("shape: {}\n".format(X.shape))

# the vocab
vocab = lda.datasets.load_reuters_vocab()
print("type(vocab): {}".format(type(vocab)))
print("len(vocab): {}\n".format(len(vocab)))

# titles for each story
titles = lda.datasets.load_reuters_titles()
print("type(titles): {}".format(type(titles)))
print("len(titles): {}\n".format(len(titles)))


doc_id = 0
word_id = 3117

print("doc id: {} word id: {}".format(doc_id, word_id))
print("-- count: {}".format(X[doc_id, word_id]))
print("-- word : {}".format(vocab[word_id]))
print("-- doc  : {}".format(titles[doc_id]))


model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
model.fit(X)


topic_word = model.topic_word_ 
print("type(topic_word): {}".format(type(topic_word)))
print("shape: {}".format(topic_word.shape))


for n in range(5):
    sum_pr = sum(topic_word[n,:])
    print("topic: {} sum: {}".format(n, sum_pr))


n = 5
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1]
    print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))


doc_topic = model.doc_topic_
print("type(doc_topic): {}".format(type(doc_topic)))
print("shape: {}".format(doc_topic.shape))




1 个答案:

答案 0 :(得分:5)

我知道这有点晚了,但希望它有所帮助。您首先必须了解LDA仅适用于DTM(文档术语矩阵)。所以,我建议你执行以下步骤:

  1. 加载您的csv文件
  2. 从文件中提取必要的推文
  3. 清理数据
  4. 创建包含所生成语料库的每个单词的字典
  5. 构建TDM结构
  6. 使结构适合您的数据文件
  7. 获取词汇 - TDM功能(单词)
  8. 继续使用上面的代码
  9. 在这里,可以提供此代码以帮助您入门 -

    token_dict = {}
    
    for i in range(len(txt1)):
        token_dict[i] = txt1[i]
    
    len(token_dict)
    
    
    print("\n Build DTM")
    %time tf = CountVectorizer(stop_words='english')
    
    print("\n Fit DTM")
    %time tfs1 = tf.fit_transform(token_dict.values())
    
    # set the number of topics to look for
    num = 8
    
    model = lda.LDA(n_topics=num, n_iter=500, random_state=1)
    
    # we fit the DTM not the TFIDF to LDA
    print("\n Fit LDA to data set")
    %time model.fit_transform(tfs1)
    
    print("\n Obtain the words with high probabilities")
    %time topic_word = model.topic_word_  # model.components_ also works
    
    print("\n Obtain the feature names")
    %time vocab = tf.get_feature_names()