val sparkConf = new SparkConf().setMaster("yarn-cluster")
.setAppName("SparkJob")
.set("spark.executor.memory","2G")
.set("spark.dynamicAllocation.executorIdleTimeout","5")
val streamingContext = new StreamingContext(sparkConf, Minutes(1))
var historyRdd: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD
var historyRdd_2: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD
val stream_1 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams , Set(inputTopic_1))
val dstream_2 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams , Set(inputTopic_2))
val dstream_2 = stream_2.map((r: Tuple2[String, GenericData.Record]) =>
{
//some mapping
}
val historyDStream = dstream_1.transform(rdd => rdd.union(historyRdd))
dstream_2.foreachRDD(r => r.repartition(500))
val historyDStream_2 = dstream_2.transform(rdd => rdd.union(historyRdd_2))
val fullJoinResult = historyDStream.fullOuterJoin(historyDStream_2)
val filtered = fullJoinResult.filter(r => r._2._1.isEmpty)
filtered.foreachRDD{rdd =>
val formatted = rdd.map(r => (r._1 , r._2._2.get))
historyRdd_2.unpersist(false) // unpersist the 'old' history RDD
historyRdd_2 = formatted // assign the new history
historyRdd_2.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
val filteredStream = fullJoinResult.filter(r => r._2._2.isEmpty)
filteredStream.foreachRDD{rdd =>
val formatted = rdd.map(r => (r._1 , r._2._1.get))
historyRdd.unpersist(false) // unpersist the 'old' history RDD
historyRdd = formatted // assign the new history
historyRdd.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
streamingContext.start()
streamingContext.awaitTermination()
}
}
这里我的stream_1和dstream_2有128个分区但是当我进行连接时,分区会减少到3个分区,为什么会这样。据我所知,连接是分区完成的,即分区1将与另一个Rdd的分区1连接。所有过滤的RDD都有3个分区,这就是historyRDD和HistoryRDD2有3个分区的原因。
答案 0 :(得分:0)
spark中的分区取决于您执行的操作。例如,groupByKey()
保留其父RDD的分区号(如果有的话)。相反,某些操作(如join()
)总结了RDD1和RDD2 的分区数。假设RDD1有2个分区且RDD2有3个分区,则join()的结果为5.