Question

我正在使用带有java和Cassandra数据库的spark，在我的程序中我使用mapPartitions来请求cassadra。但我注意到我的mapPartitions只在一个火花节点中执行。为了查看我的RDD中的分区数，我使用了：

System.out.println(MyRDD.partitions().size());

它显示1个分区。我发现我可以编辑分区的数量：

JavaRDD MyRDD2= MyRDD.coalesce(8, false);

但它不起作用，我的分区号仍然是1.

请帮我改变分区数量吗？

Answer 1

您必须将shuffle设置为true才能合并到更多的分区：

JavaRDD MyRDD2= MyRDD.coalesce(8, true);

Answer 2

As per coalesce() function of RDD, we can reduce the number of partition. For increasing partition number repartition() function should use.

var textRDD = scontext.textFile("file:///home/rajeev/Test.scala", 3);

    print("================== "+textRDD.getNumPartitions);

   var newRDD = textRDD.coalesce(6, false);
print("==================:: "+newRDD.getNumPartitions+"\n");

   var newRDD1 = textRDD.coalesce(6, true);
print("==================:: "+newRDD1.getNumPartitions+"\n");

Output is 3 and 3 and 6 respective print statement.

Ideally it should not be happen. Please could you explain. Is it because we are shuffling data.

如何使用coalesce更改分区数？

2 个答案: