无法在Spark 2.0中的数据集[(scala.Long,org.apache.spark.mllib.linalg.Vector)]上运行LDA

时间:2016-08-06 17:28:11

标签: scala apache-spark apache-spark-mllib

我正在关注有关LDA示例的this教程视频,并且我遇到了以下问题:

<console>:37: error: overloaded method value run with alternatives:
  (documents: org.apache.spark.api.java.JavaPairRDD[java.lang.Long,org.apache.spark.mllib.linalg.Vector])org.apache.spark.mllib.clustering.LDAModel <and>
  (documents: org.apache.spark.rdd.RDD[(scala.Long, org.apache.spark.mllib.linalg.Vector)])org.apache.spark.mllib.clustering.LDAModel
  cannot be applied to (org.apache.spark.sql.Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)])
     val model = run(lda_countVector)
                                   ^

所以我想将这个DF转换为RDD,但它总是被指定为DataSet。有谁可以请看这个问题?

// Convert DF to RDD
import org.apache.spark.mllib.linalg.Vector
val lda_countVector = countVectors.map { case Row(id: Long, countVector: Vector) => (id, countVector) }
// import org.apache.spark.mllib.linalg.Vector
// lda_countVector: org.apache.spark.sql.Dataset[(Long, org.apache.spark.mllib.linalg.Vector)] = [_1: bigint, _2: vector]

2 个答案:

答案 0 :(得分:4)

Spark API在1.x和2.x分支之间更改。特别是DataFrame.map会返回Dataset而不是RDD,因此结果与旧的基于MLlib RDD的API不兼容。您应首先将数据转换为RDD,如下所示:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}

val a = Vectors.dense(Array(1.0, 2.0, 3.0))
val b = Vectors.dense(Array(3.0, 4.0, 5.0))
val df = Seq((1L ,a), (2L, b), (2L, a)).toDF

val ldaDF = df.rdd.map { 
  case Row(id: Long, countVector: Vector) => (id, countVector) 
} 

val model = new LDA().setK(3).run(ldaDF)

或者您可以转换为类型化数据集,然后转换为RDD:

val model = new LDA().setK(3).run(df.as[(Long, Vector)].rdd)

答案 1 :(得分:0)

我正在遵循相同的示例。收到此错误。有什么建议吗?

scala> lda_countVector.take(1) 15年6月20日15:44:53错误TaskSetManager:阶段8.0中的任务0失败4次;放弃工作 org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段8.0中的任务0失败4次,最近一次失败:阶段8.0中的任务0.3(TID 16,brdn6232.target.com,执行者1)丢失:scala。 MatchError:[0,(6139,[0,1,147,231,315,496,497,527,569,604,776,835,848,858,942,1144,1687,1980,2051,2455,2756,3060,3465,3660,4506,5434,5599],[1.0,1.0,1.0,1.0,1.0,1.0,1.0 ,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])]] >