Question

在声明RDD时是否可以设置分区数

在浏览完文档后，我看不到任何方法。我看到parallelize（），但它带有一个列表，似乎不适用于我的情况。

这是我进行所有设置的方式：

    SparkConf sparkConf = new SparkConf().setAppName("MyApp")
            .set("master", "yarn-cluster")
            .set("spark.submit.deployMode" ,"cluster")
            .set("spark.executor.instances","8")
            .set("spark.executor.cores","4")
            .set("spark.executor.memory","5120M")
            .set("spark.driver.memory","5120M")
            .set("spark.yarn.memoryOverhead","10000M")
            .set("spark.yarn.driver.memoryOverhead","10000M")
            .set("spark.dynamicAllocation.enabled", "true");

Configuration conf = new HBaseConfiguration().create();
avaPairRDD<ImmutableBytesWritable, Result> hbaseRdd = sparkContext.newAPIHadoopRDD(conf,TableInputFormat.class, ImmutableBytesWritable.class, Result.class);
hbaseRdd.saveAsHadoopFile(fileSystemPath, TextInputFormat.class,LongWritable.class, TextOutputFormat.class, GzipCodec.class);

我希望它可以在多个分区上运行

我看到您可以执行sparkContext.parallelize（...）。newAPIHadoopRDD .....，但这似乎不适用于我的情况。

手动设置分区数-JavaPairRDD

0 个答案: