Question

Spark中是否有方法（A方法）找出Parition ID / No

在这里举个例子

val input1 = sc.parallelize(List(8, 9, 10), 3)

val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)

                               x + y)}

我想在???中放一些代码来打印分区ID /否

Answer 1

您也可以使用

TaskContext.getPartitionId()

例如，代替目前缺少的foreachPartitionWithIndex（）

https://github.com/apache/spark/pull/5927#issuecomment-99697229

Answer 2

根据@Holden的建议，使用mapParitionsWithIndex发布答案。

我创建了一个带有3个分区的RDD（Input）。 input中的元素在调用index

时使用分区索引（mapPartitionsWithIndex）进行标记

scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21

scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)

Answer 3

确实，mapParitionsWithIndex会给你一个迭代器＆amp;分区索引。（这当然与减少不同，但您可以将结果与aggregate结合起来。）

Answer 4

我在寻找spark_partition_id的{{1}} sql函数时遇到了这个老问题。

DataFrame

找出分区号/ id

4 个答案: