我正在关注有关LDA示例的this教程视频,并且我遇到了以下问题:
<console>:37: error: overloaded method value run with alternatives:
(documents: org.apache.spark.api.java.JavaPairRDD[java.lang.Long,org.apache.spark.mllib.linalg.Vector])org.apache.spark.mllib.clustering.LDAModel <and>
(documents: org.apache.spark.rdd.RDD[(scala.Long, org.apache.spark.mllib.linalg.Vector)])org.apache.spark.mllib.clustering.LDAModel
cannot be applied to (org.apache.spark.sql.Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)])
val model = run(lda_countVector)
^
所以我想将这个DF转换为RDD,但它总是被指定为DataSet。有谁可以请看这个问题?
// Convert DF to RDD
import org.apache.spark.mllib.linalg.Vector
val lda_countVector = countVectors.map { case Row(id: Long, countVector: Vector) => (id, countVector) }
// import org.apache.spark.mllib.linalg.Vector
// lda_countVector: org.apache.spark.sql.Dataset[(Long, org.apache.spark.mllib.linalg.Vector)] = [_1: bigint, _2: vector]
答案 0 :(得分:4)
Spark API在1.x和2.x分支之间更改。特别是DataFrame.map
会返回Dataset
而不是RDD
,因此结果与旧的基于MLlib RDD
的API不兼容。您应首先将数据转换为RDD
,如下所示:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
val a = Vectors.dense(Array(1.0, 2.0, 3.0))
val b = Vectors.dense(Array(3.0, 4.0, 5.0))
val df = Seq((1L ,a), (2L, b), (2L, a)).toDF
val ldaDF = df.rdd.map {
case Row(id: Long, countVector: Vector) => (id, countVector)
}
val model = new LDA().setK(3).run(ldaDF)
或者您可以转换为类型化数据集,然后转换为RDD:
val model = new LDA().setK(3).run(df.as[(Long, Vector)].rdd)
答案 1 :(得分:0)
我正在遵循相同的示例。收到此错误。有什么建议吗?
scala> lda_countVector.take(1) 15年6月20日15:44:53错误TaskSetManager:阶段8.0中的任务0失败4次;放弃工作 org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段8.0中的任务0失败4次,最近一次失败:阶段8.0中的任务0.3(TID 16,brdn6232.target.com,执行者1)丢失:scala。 MatchError:[0,(6139,[0,1,147,231,315,496,497,527,569,604,776,835,848,858,942,1144,1687,1980,2051,2455,2756,3060,3465,3660,4506,5434,5599],[1.0,1.0,1.0,1.0,1.0,1.0,1.0 ,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])]] >