如何在Spark中解决嵌套的RDD

时间:2017-03-07 05:03:07

标签: scala apache-spark recommendation-engine

//For your request, I updated my code for value 'combinations'.
//val list =  rdd looks like this -> org.apache.spark.rdd.RDD[(Int, Array[Int])] 
 val combinations = list.mapValues(_.toSeq.combinations(2).toArray.map{ case Seq(x,y) => (x,y)}).map(_._2)

val combinations :Array[Array[(Int, Int)]] = Array(Array((1953,1307), (1953,527), (1953,1272), (1953,1387), (1953,318)),Array(( ...))...)


val simOnly = combinations.foreach{ x => 
    x.map{ case(item_1, item_2) =>
        val itemFactor_1 = modelMLlib.productFeatures.lookup(item_1).head
        val itemFactor_2 = modelMLlib.productFeatures.lookup(item_2).head
        val itemVector_1 = new DoubleMatrix(itemFactor_1)
        val itemVector_2 = new DoubleMatrix(itemFactor_2)
        val sim = cosineSimilarity(itemVector_1,itemVector_2)
        sim
    }
}

这是我的代码,用于计算项目之间的余弦相似度。

但是嵌套的RDD并不支持apache spark

我该如何正确解决这个问题?

  And my interpreter show this
 //org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
  (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
 (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.

0 个答案:

没有答案