从TF-IDF到火花中的LDA聚类,pyspark

时间:2016-02-23 17:03:59

标签: python apache-spark pyspark tf-idf lda

我正在尝试对存储在格式密钥,listofwords

中的推文进行聚类

我的第一步是使用带有

的数据框提取单词列表的TF-IDF值
dbURL = "hdfs://pathtodir"  
file = sc.textFile(dbURL)
#Define data frame schema
fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)]
schema = StructType(fields)
#Data in format <key>,<listofwords>
file_temp = file.map(lambda l : l.split(","))
file_df = sqlContext.createDataFrame(file_temp, schema)
#Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html
tokenizer = Tokenizer(inputCol='content', outputCol='words')
wordsData = tokenizer.transform(file_df)
hashingTF = HashingTF(inputCol='words',outputCol='rawFeatures',numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol='rawFeatures',outputCol='features')
idfModel = idf.fit(featurizedData)
rescaled_data = idfModel.transform(featurizedData)

根据Preparing data for LDA in spark的建议,我尝试根据this example将输出重新格式化为我期望成为LDA的输入,我开始时:

indexer = StringIndexer(inputCol='key',outputCol='KeyIndex')
indexed_data = indexer.fit(rescaled_data).transform(rescaled_data).drop('key').drop('content').drop('words').drop('rawFeatures')

但是现在我无法找到一种好方法将我的数据帧转换为前一个示例中提出的格式或this example

如果有人能指出我正确看待的地方,或者如果我的方法有误,可能会纠正我,我将非常感激。

我认为从一系列文档中提取TF-IDS向量并将它们聚类应该是一个相当经典的事情,但我找不到一个简单的方法。

2 个答案:

答案 0 :(得分:4)

LDA期望将(id,features)作为输入,因此假设KeyIndex用作ID:

from pyspark.mllib.clustering import LDA

k = ... # number of clusters
corpus = indexed_data.select(col("KeyIndex").cast("long"), "features").map(list)
model = LDA.train(corpus, k=k)

答案 1 :(得分:-1)

LDA不将TF-IDF矩阵作为输入。相反,它只需要TF矩阵。例如:

from pyspark.ml.feature import *
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.clustering import LDA

tokenizer = Tokenizer(inputCol="hashTagDocument", outputCol="words")

stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered", 
stopWords=stopwords)

vectorizer = CountVectorizer(inputCol="filtered", outputCol="features", 
vocabSize=40000, minDF=5) 

pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
pipelineModel = pipeline.fit(corpus)

pipelineModel.stages