在两个RDD列表中进行for循环以获取类似的密钥

时间:2018-10-14 01:29:30

标签: scala apache-spark rdd

我有两个RDD[(String, Array[(String, Array[String])])]形式的rdds。我其中有数据:

rdd1 = (4, [(0, [1,4,5,6]), (2, [4,5,6])])

(5, [(0, [1,4,5,6]), (2, [4,5,6])]) ......

rdd2 be like = (4, [(0, [1,4,6])])

(5, [(1, [2,5,6]), (2, [3,5])])......

首先,我想检查rdd2中是否也存在rdd1的键,然后针对其数组中的元组,我想对rdd1中的每个元组以及rdd2中该键的每个元组运行一个for循环。例如,rdd1和rdd2都包含键4。因此,我想为该键4运行一个for循环,其项应类似于(0, [1,4,5,6]) (0, [1,4,6])(2, [4,5,6]) (0, [1,4,6])。通过对这些数据进行迭代,我必须对此进行一些操作。

我试图做的是将这两个rdds组合在一起,然后应用for循环,但这也将遍历相同rdds的元组。

val rdd3 = merged_both_rdd1_rdd2_by_key.flatMap(x=> {for(i <- 0 until x._2.size) {for(j <- i until x._2.size)} })

但是,这也会遍历同一rdd的元组。我只想使用rdd2将每个rdd1元组迭代到每个元组。

我试图为两个rdds嵌套嵌套循环,但这给了我一些错误。

    val sortedLines2 = sortedLines1.flatMap(y => {
                                              var myMap: Map[(String, String),Double] = Map()
                                              val second = sortedLines12.flatMap(x => { var myMap1: Map[(String, String),Double] = Map()
                                              for(i <- 0 until x._2.size)
                                              {
                                                for(j <- 0 until y._2.size)
                                                {
                                                  if(i != j)
                                                  {
                                                    val inter = (x._2(i)._2.toSet & y._2(j)._2.toSet).size.toDouble
                                                    val union = (x._2(i)._2.toSet.size + y._2(j)._2.toSet.size).toDouble - inter
                                                    val div = inter/union
                                                    if(div >= threshold)
                                                    { 
                                                      if(!myMap.contains((x._2(i)._1, y._2(j)._1)) )
                                                      {
                                                          myMap += ( (x._2(i)._1, y._2(j)._1) -> div )
                                                          myMap1 += ( (x._2(i)._1, x._2(j)._1) -> div )
                                                      }
                                                    }
                                                  }
                                                 } 
                                               }
                                               myMap1
                                              }
)
myMap
}
)

这样做,我得到以下错误:

    This RDD lacks a SparkContext. It could happen in the following cases: 
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.

1 个答案:

答案 0 :(得分:1)

您可以先尝试通过密钥加入rdds:

DATE/ TIME/ DATEIME

然后循环遍历联接rdd值:

rddsJoin = rdd1.join(rdd2)

如果要进行转换(而不是操作),请根据您的应用程序需求将rddsJoin.foreach{case(key,(v1,v2)) => {for(vE1<-v1;vE2<-v2) {doSomething(vE1,vE2)}}} 替换为foreachmap