我有两个RDD[(String, Array[(String, Array[String])])]
形式的rdds。我其中有数据:
rdd1 = (4, [(0, [1,4,5,6]), (2, [4,5,6])])
(5, [(0, [1,4,5,6]), (2, [4,5,6])]) ......
rdd2 be like = (4, [(0, [1,4,6])])
(5, [(1, [2,5,6]), (2, [3,5])])......
首先,我想检查rdd2中是否也存在rdd1的键,然后针对其数组中的元组,我想对rdd1中的每个元组以及rdd2中该键的每个元组运行一个for循环。例如,rdd1和rdd2都包含键4。因此,我想为该键4运行一个for循环,其项应类似于(0, [1,4,5,6]) (0, [1,4,6])
和(2, [4,5,6]) (0, [1,4,6])
。通过对这些数据进行迭代,我必须对此进行一些操作。
我试图做的是将这两个rdds组合在一起,然后应用for循环,但这也将遍历相同rdds的元组。
val rdd3 = merged_both_rdd1_rdd2_by_key.flatMap(x=> {for(i <- 0 until x._2.size) {for(j <- i until x._2.size)} })
但是,这也会遍历同一rdd的元组。我只想使用rdd2将每个rdd1元组迭代到每个元组。
我试图为两个rdds嵌套嵌套循环,但这给了我一些错误。
val sortedLines2 = sortedLines1.flatMap(y => {
var myMap: Map[(String, String),Double] = Map()
val second = sortedLines12.flatMap(x => { var myMap1: Map[(String, String),Double] = Map()
for(i <- 0 until x._2.size)
{
for(j <- 0 until y._2.size)
{
if(i != j)
{
val inter = (x._2(i)._2.toSet & y._2(j)._2.toSet).size.toDouble
val union = (x._2(i)._2.toSet.size + y._2(j)._2.toSet.size).toDouble - inter
val div = inter/union
if(div >= threshold)
{
if(!myMap.contains((x._2(i)._1, y._2(j)._1)) )
{
myMap += ( (x._2(i)._1, y._2(j)._1) -> div )
myMap1 += ( (x._2(i)._1, x._2(j)._1) -> div )
}
}
}
}
}
myMap1
}
)
myMap
}
)
这样做,我得到以下错误:
This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
答案 0 :(得分:1)
您可以先尝试通过密钥加入rdds:
DATE/ TIME/ DATEIME
然后循环遍历联接rdd值:
rddsJoin = rdd1.join(rdd2)
如果要进行转换(而不是操作),请根据您的应用程序需求将rddsJoin.foreach{case(key,(v1,v2)) =>
{for(vE1<-v1;vE2<-v2) {doSomething(vE1,vE2)}}}
替换为foreach
或map
。