我正在尝试学习在Python中使用Spark LDA modeling,希望提取数百个Reddit帖子的主题。
但是,在训练模型后,当我应用方法describeTopics()
时,结果如下所示:
[([441832,8563,731824,381507,933925], [0.0062627265685400516, 0.005369477351664474, 0.005309586577412947, 0.00503830331115649, 0.004271026596928107]),...
我想知道在这个输出中[441832, 8563, 731824, 381507, 933925]
是否表示Spark词汇表的单词索引。如果是,也许我可以找到一种方法来知道哪个索引指向哪个词。
所以我做了一个非常简单的测试,看看这些数字是否是单词索引。这个测试的过程就像我在数百个帖子上所做的那样,相反,我只在这里使用了1个帖子。
首先,我将帖子分开并删除了停用词
test_str = "Emmanuel is the most lovely cat in the whole universe. No one is more lovely than Emmanuel! Sweetest Emmanuel, aha?! Sweetest Emmanuel, haha!"
replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
review_text = test_str.translate(replace_punctuation).split()
review_words = [w.lower() for w in review_text if w.lower() not in stopwords]
print review_words
输出:
<'emmanuel','lovely','cat','whole','universe','one','lovely','emmanuel','sweetest','emmanuel','aha','最甜蜜','伊曼纽尔','哈哈']
然后,我将这些单词转换为tf-idf分数并将分数标准化
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
from pyspark.mllib.feature import Normalizer
def get_tfidf_features(txt_rdd):
hashingTF = HashingTF()
tf = hashingTF.transform(txt_rdd)
tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
return tfidf
nor = Normalizer(1)
review_words_rdd = sc.parallelize(review_words)
test_words_bag = get_tfidf_features(review_words_rdd)
nor_test_words_bag = nor.transform(test_words_bag)
我被卡住的地方:
现在,无论我应用print test_words_bag.collect()
还是print nor_test_words_bag.collect()
,令我惊讶的是,我得到了一个SparseVector列表,而不仅仅是1 .....
[SparseVector(1048576,{145380:0.4463,282612:0.9163,664173:0.6286,738284:2.1972,812399:0.7621,897504:0.6286}),SparseVector(1048576,{145380:0.2231,356727:1.3218,579064: 1.6094,664173:1.2572,886510:1.0986}),SparseVector(1048576,{208501:1.3218,897504:0.6286,1045730:2.0149}),SparseVector(1048576,{145380:0.2231,367721:1.3218,430838:1.3218,664173: 0.6286,886510:1.0986}),SparseVector(1048576,{60275:2.0149,134386:1.3218,145380:0.4463,282612:0.9163,356727:1.3218,441832:2.0149,812399:0.7621}),SparseVector(1048576,{145380: 0.2231,812399:0.7621,886510:1.0986}),SparseVector(1048576,{145380:0.2231,356727:1.3218,579064:1.6094,664173:1.2572,886510:1.0986}),SparseVector(1048576,{145380:0.4463,282612: 0.9163,664173:0.6286,738284:2.1972,812399:0.7621,897504:0.6286}),SparseVector(1048576,{134386:2.6435,145380:0.6694,208501:2.6435,430838:1.3218}),SparseVector(1048576,{145380: 0.4463,282612:0.9163,664173:0.62 86,738284:2.1972,812399:0.7621,897504:0.6286}),SparseVector(1048576,{367721:1.3218,897504:1.2572}),SparseVector(1048576,{134386:2.6435,145380:0.6694,208501:2.6435,430838: 1.3218}),SparseVector(1048576,{145380:0.4463,282612:0.9163,664173:0.6286,738284:2.1972,812399:0.7621,897504:0.6286}),SparseVector(1048576,{367721:2.6435,897504:1.2572})]
我真的不明白为什么当输入只有1个字符串时我得到了SparseVector列表。 因此,我不知道如何将这些看起来像索引的数字更改回原始单词,以便我知道提取的主题是如何看起来像
您是否知道从Spark LDA Modeling输出中获取人类可读的主题?