Question

我有一个键值RDD，我需要使用此RDD连接几个键集。

键值RDD很大（100GB），键集相对较小（但不足以播放它）

我将同一个分区程序分配给所有RDD并调用join。

预期行为 在重新分区之后，所有要连接的数据都是共处的，加入足够快。如果键RDD很小，它应该是快速的。

实际行为 即使键RDD很小或为空，加入也需要很长时间（约10分钟）。

//declare common partitioner for all rdds
val partitioner = new HashPartitioner(500)

//declare key-value rdd
val storage: RDD[(K, V)] = {
  val storage0: RDD[(K,V)] = ???
  storage0.partitionBy(partitioner).persist(StorageLevel.MEMORY_AND_DISK)
}
storage.count()

//join several rdds with the storage
(1 to 1000).foreach(i => {
  val keys: RDD[K] = ???
  val partitionedKeys = keys.map(k => k -> ()).partitionBy(partitioner)
  //join keys rdd with the storage, do smth with the result
  partitionedKeys.join(storage).foreachPartition(iter => {
     ???
  })
})

我在这里出错了什么？

Spark：使用缓存的RDD加入时间过长

0 个答案: