为什么加入后分区的数量在Spark Streaming中是不同的

时间:2017-07-19 18:11:44

标签: apache-spark spark-streaming

val sparkConf = new SparkConf()
val streamingContext = new StreamingContext(sparkConf, Minutes(1))

var historyRdd: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD

var historyRdd_2: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD


val stream_1 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams ,  Set(inputTopic_1))
val dstream_2 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams ,  Set(inputTopic_2))


val dstream_2 = stream_2.map((r: Tuple2[String, GenericData.Record]) => 
{
  //some mapping
}

val historyDStream = dstream_1.transform(rdd => rdd.union(historyRdd))
val historyDStream_2 = dstream_2.transform(rdd => rdd.union(historyRdd_2))
val fullJoinResult = historyDStream.fullOuterJoin(historyDStream_2)

 val filtered = fullJoinResult.filter(r => r._2._1.isEmpty)


filtered.foreachRDD{rdd =>

  val formatted = rdd.map(r  => (r._1 , r._2._2.get)) 

  historyRdd_2.unpersist(false) // unpersist the 'old' history RDD
  historyRdd_2 = formatted // assign the new history
  historyRdd_2.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}


val filteredStream = fullJoinResult.filter(r => r._2._2.isEmpty)


filteredStream.foreachRDD{rdd =>
  val formatted = rdd.map(r => (r._1 , r._2._1.get)) 
  historyRdd.unpersist(false) // unpersist the 'old' history RDD
  historyRdd = formatted // assign the new history
  historyRdd.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
streamingContext.start()
streamingContext.awaitTermination()
  }
}

DStream_1和DStream_2各有128个分区,但在执行连接后,生成的DStream有3个分区,我没有进行任何重新分区。我有这个想法,如果没有。 DStream的分区是相同的,然后连接结果DStream具有相同数量的分区,因为连接发生分区到分区。如果我在这种情况下错了,请纠正我。

0 个答案:

没有答案