使用spark为KMeans的文本数据生成矢量

时间:2017-04-14 10:44:44

标签: apache-spark machine-learning

我是Spark和机器学习的新手。我正在尝试使用KMeans聚类一些数据,如

1::Hi How are you
2::I am fine, how about you

在数据中,分隔符为::而实际的群集文本是具有文本数据的第二列。 在阅读了火花官方页面和我编写的大量文章之后,但是我无法生成向量作为KMeans.train步骤的输入。

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

val sc = new SparkContext("local", "test") 

val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val rawData = sc.textFile("data/mllib/KM.txt").map(line => line.split("::")(1))

val sentenceData = rawData.toDF("sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")

val wordsData = tokenizer.transform(sentenceData)

val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)

val clusters = KMeans.train(featurizedData, 2, 10)

我收到以下错误

<console>:27: error: type mismatch;
 found   : org.apache.spark.sql.DataFrame
    (which expands to)  org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
       val clusters = KMeans.train(featurizedData, 2, 10)

请建议如何处理KMeans的输入数据

提前致谢。

1 个答案:

答案 0 :(得分:1)

最后,我在更换以下代码后开始工作。

val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)

val clusters = KMeans.train(featurizedData, 2, 10)

使用

val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")

val kmeans = new KMeans().setK(2).setFeaturesCol("features").setPredictionCol("prediction")

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, kmeans))