尽管执行程序闲置,但Apache Spark通过自定义分区程序将不同的分区分配给同一执行程序

时间:2019-05-19 01:20:28

标签: java apache-spark apache-spark-2.0

我有一个问题,尽管有执行器可用,但使用自定义分区程序,Spark会将两个不同的分区分配给同一执行器。具有/不具有自定义分区程序的分区结构似乎相同,但是由于某些原因,分区分配似乎有所不同。这是手头问题的简化版本:

默认分区程序:

Partitions Structure:
[[(0,null)], [(1,null)], [(2,null)], [(3,null)], [(4,null)], [(5,null)]]

DAGScheduler:54 - Submitting 6 missing tasks from ResultStage 1 
(MapPartitionsRDD[4] at map at xxx.java:77) 
TaskSchedulerImpl:54 - Adding task set 1.0 with 6 tasks

Starting task 0.0 in stage 1.0 (TID 6, xxx.xx.xx.4,  executor 3, partition 0, PROCESS_LOCAL, 7882 bytes)
Starting task 1.0 in stage 1.0 (TID 7, xxx.xx.xx.14, executor 4, partition 1, PROCESS_LOCAL, 7877 bytes)
Starting task 2.0 in stage 1.0 (TID 8, xxx.xx.xx.27, executor 5, partition 2, PROCESS_LOCAL, 7882 bytes)
Starting task 3.0 in stage 1.0 (TID 9, xxx.xx.xx.3,  executor 1, partition 3, PROCESS_LOCAL, 7882 bytes)
Starting task 4.0 in stage 1.0 (TID 10, xxx.xx.xx.26,executor 2, partition 4, PROCESS_LOCAL, 7882 bytes)
Starting task 5.0 in stage 1.0 (TID 11, xxx.xx.xx.9, executor 0, partition 5, PROCESS_LOCAL, 7882 bytes)

自定义分区:

Partitions Structure: 
[[(0,null)], [(1,null)], [(2,null)], [(3,null)], [(4,null)], [(5,null)]]

DAGScheduler:54 - Submitting 6 missing tasks from ResultStage 3 
(MapPartitionsRDD[5] at map at xxx.java:77) 
TaskSchedulerImpl:54 - Adding task set 3.0 with 6 tasks

Starting task 0.0 in stage 3.0 (TID 12, xxx.xx.xx.27, executor 5, partition 0, NODE_LOCAL, 7666 bytes)
Starting task 4.0 in stage 3.0 (TID 13, xxx.xx.xx.26, executor 2, partition 4, NODE_LOCAL, 7666 bytes)
Starting task 1.0 in stage 3.0 (TID 14, xxx.xx.xx.27, executor 5, partition 1, NODE_LOCAL, 7666 bytes)
Starting task 5.0 in stage 3.0 (TID 15, xxx.xx.xx.26, executor 2, partition 5, NODE_LOCAL, 7666 bytes)
Starting task 2.0 in stage 3.0 (TID 16, xxx.xx.xx.27, executor 5, partition 2, NODE_LOCAL, 7666 bytes)
Starting task 3.0 in stage 3.0 (TID 17, xxx.xx.xx.27, executor 5, partition 3, NODE_LOCAL, 7666 bytes)

以下是一些关键观察结果:

  • 分区结构完全相同
  • 默认分区程序具有从1到1的分区映射到执行程序,自定义分区程序具有一对多的映射
  • 默认分区程序是指PROCESS_LOCAL,而“自定义”是指NODE_LOCAL
  • 自定义分区程序始终引用MapPartitionsRDD[5],而没有引用MapPartitionsRDD[4]
  • 默认分区程序具有顺序的分区顺序分配[0-5],自定义分区程序已改组

这是代码: 目标是使foo在群集的每个节点上执行一次。这是通过默认分区实现的,但是使用自定义分区程序却没有成功。

Main.java

    public static void testPartitions(){
        Integer numPartitions = 6;
        List<Integer> data = Arrays.asList(0, 1, 2, 3, 4, 5);
        JavaRDD<Integer> dataRDD = SparkExecutor.sc.parallelize(data,numPartitions);
        JavaPairRDD<Integer, Object> dataPairRDD = dataRDD.mapToPair( currData -> new Tuple2<Integer, Object>(currData, null));
        TestPartitioner partitioner = new TestPartitioner(numPartitions);        
        // dataPairRDD = dataPairRDD.partitionBy(partitioner); //toggle custom partitioner

        logger.info("Num Partitions: {}", dataPairRDD.getNumPartitions());
        logger.info("Partitions Structure: {}", dataPairRDD.glom().collect());

        JavaRDD<Integer> mapRDD = dataPairRDD.map( currData -> foo(currData));        

        mapRDD.collect();
    }

    public static Integer foo(Tuple2<Integer, Object> data)throws Exception{
        Integer num = data._1();
        TimeUnit.SECONDS.sleep(5); //Simulate some work being done
        return num;
    }

TestPartitioner.java

public class TestPartitioner extends Partitioner{
    private int numPartitions;

    //Constructor
    public TestPartitioner(int numPartitions){
        this.numPartitions = numPartitions;
    }

    @Override
    public int getPartition(Object key) {
        int bucket = (int) key;
        return bucket;
    }

    @Override
    public int numPartitions() {
        return this.numPartitions;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof TestPartitioner) {
             TestPartitioner partitionerObject = (TestPartitioner) obj;
            if (partitionerObject.numPartitions == this.numPartitions)
                return true;
        }
        return false;
    }
}

0 个答案:

没有答案