我有一个问题,尽管有执行器可用,但使用自定义分区程序,Spark会将两个不同的分区分配给同一执行器。具有/不具有自定义分区程序的分区结构似乎相同,但是由于某些原因,分区分配似乎有所不同。这是手头问题的简化版本:
默认分区程序:
Partitions Structure:
[[(0,null)], [(1,null)], [(2,null)], [(3,null)], [(4,null)], [(5,null)]]
DAGScheduler:54 - Submitting 6 missing tasks from ResultStage 1
(MapPartitionsRDD[4] at map at xxx.java:77)
TaskSchedulerImpl:54 - Adding task set 1.0 with 6 tasks
Starting task 0.0 in stage 1.0 (TID 6, xxx.xx.xx.4, executor 3, partition 0, PROCESS_LOCAL, 7882 bytes)
Starting task 1.0 in stage 1.0 (TID 7, xxx.xx.xx.14, executor 4, partition 1, PROCESS_LOCAL, 7877 bytes)
Starting task 2.0 in stage 1.0 (TID 8, xxx.xx.xx.27, executor 5, partition 2, PROCESS_LOCAL, 7882 bytes)
Starting task 3.0 in stage 1.0 (TID 9, xxx.xx.xx.3, executor 1, partition 3, PROCESS_LOCAL, 7882 bytes)
Starting task 4.0 in stage 1.0 (TID 10, xxx.xx.xx.26,executor 2, partition 4, PROCESS_LOCAL, 7882 bytes)
Starting task 5.0 in stage 1.0 (TID 11, xxx.xx.xx.9, executor 0, partition 5, PROCESS_LOCAL, 7882 bytes)
自定义分区:
Partitions Structure:
[[(0,null)], [(1,null)], [(2,null)], [(3,null)], [(4,null)], [(5,null)]]
DAGScheduler:54 - Submitting 6 missing tasks from ResultStage 3
(MapPartitionsRDD[5] at map at xxx.java:77)
TaskSchedulerImpl:54 - Adding task set 3.0 with 6 tasks
Starting task 0.0 in stage 3.0 (TID 12, xxx.xx.xx.27, executor 5, partition 0, NODE_LOCAL, 7666 bytes)
Starting task 4.0 in stage 3.0 (TID 13, xxx.xx.xx.26, executor 2, partition 4, NODE_LOCAL, 7666 bytes)
Starting task 1.0 in stage 3.0 (TID 14, xxx.xx.xx.27, executor 5, partition 1, NODE_LOCAL, 7666 bytes)
Starting task 5.0 in stage 3.0 (TID 15, xxx.xx.xx.26, executor 2, partition 5, NODE_LOCAL, 7666 bytes)
Starting task 2.0 in stage 3.0 (TID 16, xxx.xx.xx.27, executor 5, partition 2, NODE_LOCAL, 7666 bytes)
Starting task 3.0 in stage 3.0 (TID 17, xxx.xx.xx.27, executor 5, partition 3, NODE_LOCAL, 7666 bytes)
以下是一些关键观察结果:
MapPartitionsRDD[5]
,而没有引用MapPartitionsRDD[4]
这是代码:
目标是使foo
在群集的每个节点上执行一次。这是通过默认分区实现的,但是使用自定义分区程序却没有成功。
Main.java
public static void testPartitions(){
Integer numPartitions = 6;
List<Integer> data = Arrays.asList(0, 1, 2, 3, 4, 5);
JavaRDD<Integer> dataRDD = SparkExecutor.sc.parallelize(data,numPartitions);
JavaPairRDD<Integer, Object> dataPairRDD = dataRDD.mapToPair( currData -> new Tuple2<Integer, Object>(currData, null));
TestPartitioner partitioner = new TestPartitioner(numPartitions);
// dataPairRDD = dataPairRDD.partitionBy(partitioner); //toggle custom partitioner
logger.info("Num Partitions: {}", dataPairRDD.getNumPartitions());
logger.info("Partitions Structure: {}", dataPairRDD.glom().collect());
JavaRDD<Integer> mapRDD = dataPairRDD.map( currData -> foo(currData));
mapRDD.collect();
}
public static Integer foo(Tuple2<Integer, Object> data)throws Exception{
Integer num = data._1();
TimeUnit.SECONDS.sleep(5); //Simulate some work being done
return num;
}
TestPartitioner.java
public class TestPartitioner extends Partitioner{
private int numPartitions;
//Constructor
public TestPartitioner(int numPartitions){
this.numPartitions = numPartitions;
}
@Override
public int getPartition(Object key) {
int bucket = (int) key;
return bucket;
}
@Override
public int numPartitions() {
return this.numPartitions;
}
@Override
public boolean equals(Object obj) {
if (obj instanceof TestPartitioner) {
TestPartitioner partitionerObject = (TestPartitioner) obj;
if (partitionerObject.numPartitions == this.numPartitions)
return true;
}
return false;
}
}