如何将Spark LDA Modeling数值结果转换为原始单词

时间:2016-06-18 23:20:55

标签: python apache-spark lda

我正在尝试学习在Python中使用Spark LDA modeling,希望提取数百个Reddit帖子的主题。

但是,在训练模型后,当我应用方法describeTopics()时,结果如下所示:

  

[([441832,8563,731824,381507,933925],     [0.0062627265685400516,      0.005369477351664474,      0.005309586577412947,      0.00503830331115649,      0.004271026596928107]),...

我想知道在这个输出中[441832, 8563, 731824, 381507, 933925]是否表示Spark词汇表的单词索引。如果是,也许我可以找到一种方法来知道哪个索引指向哪个词。

所以我做了一个非常简单的测试,看看这些数字是否是单词索引。这个测试的过程就像我在数百个帖子上所做的那样,相反,我只在这里使用了1个帖子。

首先,我将帖子分开并删除了停用词

test_str = "Emmanuel is the most lovely cat in the whole universe. No one is more lovely than Emmanuel! Sweetest Emmanuel, aha?! Sweetest Emmanuel, haha!"

replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
review_text = test_str.translate(replace_punctuation).split()
review_words = [w.lower() for w in review_text if w.lower() not in stopwords]

print review_words

输出:

  <'emmanuel','lovely','cat','whole','universe','one','lovely','emmanuel','sweetest','emmanuel','aha','最甜蜜','伊曼纽尔','哈哈']

然后,我将这些单词转换为tf-idf分数并将分数标准化

from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
from pyspark.mllib.feature import Normalizer

def get_tfidf_features(txt_rdd):
    hashingTF = HashingTF()
    tf = hashingTF.transform(txt_rdd)
    tf.cache()
    idf = IDF().fit(tf)
    tfidf = idf.transform(tf)

    return tfidf

nor = Normalizer(1)

review_words_rdd = sc.parallelize(review_words)
test_words_bag = get_tfidf_features(review_words_rdd)
nor_test_words_bag = nor.transform(test_words_bag)

我被卡住的地方: 现在,无论我应用print test_words_bag.collect()还是print nor_test_words_bag.collect(),令我惊讶的是,我得到了一个SparseVector列表,而不仅仅是1 .....

  

[SparseVector(1048576,{145380:0.4463,282612:0.9163,664173:0.6286,738284:2.1972,812399:0.7621,897504:0.6286}),SparseVector(1048576,{145380:0.2231,356727:1.3218,579064: 1.6094,664173:1.2572,886510:1.0986}),SparseVector(1048576,{208501:1.3218,897504:0.6286,1045730:2.0149}),SparseVector(1048576,{145380:0.2231,367721:1.3218,430838:1.3218,664173: 0.6286,886510:1.0986}),SparseVector(1048576,{60275:2.0149,134386:1.3218,145380:0.4463,282612:0.9163,356727:1.3218,441832:2.0149,812399:0.7621}),SparseVector(1048576,{145380: 0.2231,812399:0.7621,886510:1.0986}),SparseVector(1048576,{145380:0.2231,356727:1.3218,579064:1.6094,664173:1.2572,886510:1.0986}),SparseVector(1048576,{145380:0.4463,282612: 0.9163,664173:0.6286,738284:2.1972,812399:0.7621,897504:0.6286}),SparseVector(1048576,{134386:2.6435,145380:0.6694,208501:2.6435,430838:1.3218}),SparseVector(1048576,{145380: 0.4463,282612:0.9163,664173:0.62 86,738284:2.1972,812399:0.7621,897504:0.6286}),SparseVector(1048576,{367721:1.3218,897504:1.2572}),SparseVector(1048576,{134386:2.6435,145380:0.6694,208501:2.6435,430838: 1.3218}),SparseVector(1048576,{145380:0.4463,282612:0.9163,664173:0.6286,738284:2.1972,812399:0.7621,897504:0.6286}),SparseVector(1048576,{367721:2.6435,897504:1.2572})]

我真的不明白为什么当输入只有1个字符串时我得到了SparseVector列表。 因此,我不知道如何将这些看起来像索引的数字更改回原始单词,以便我知道提取的主题是如何看起来像

您是否知道从Spark LDA Modeling输出中获取人类可读的主题?

0 个答案:

没有答案