访问MapPartitionsRDD

时间:2016-02-03 13:57:47

标签: scala apache-spark

我试图通过将用户的RDD映射到模型的recommendedProducts方法来从MatrixFactorizationModel中提取预测。这给了我一个MapPartitionsRDD。尝试减少或以其他方式访问此RDD会给我一个Spark Exception。

以下是简化代码:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import org.apache.spark.mllib.recommendation.{ALS, Rating, MatrixFactorizationModel}

val users = sc.parallelize(List(1,2))
val trainingData = sc.parallelize(List(Rating(1,1,0.5),Rating(1,2,0.5),Rating(2,1,1),Rating(2,3,1))).cache()

val model = ALS.trainImplicit(trainingData, 6, 20, 0.1, 2)

val recommendations = users.map(model.recommendProducts(_,2))

recommendations.first

错误发生在最后一行:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 11500.0 failed 1 times, most recent failure: Lost task 2.0 in stage 11500.0 (TID 6401, localhost): org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:928)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:168)

我唯一的理论是,MapPartitionRDDs在创建时并没有实际应用该函数,因此如果模型的suggestProducts方法执行某种隐式RDD函数,也许它只在访问数据时调用此方法,因此我们得到尝试嵌套的RDD调用。在这种情况下,这是否意味着无法并行执行MatrixFactorizationModel上的任何操作?

1 个答案:

答案 0 :(得分:1)

正如我所怀疑的,看看MatrixFactorizationModel的来源,我可以看到它在内部将用户和产品功能存储为RDD。因此,必须从主设备完成对此模型的任何调用。要运行我的代码,我必须展平我的用户才能使用迭代的非RDD版本的地图:

val recommendations = users.collect.toList.map(model.recommendProducts(_,2))

recommendations.head