具有一些员工数据集。在这一点上,我需要根据某些条件对基于员工的薪水进行分区。创建了DataFrame并将其转换为Custom DataFrame对象。创建工资的自定义分区。
class SalaryPartition(override val numPartitions: Int) extends Partitioner {
override def getPartition(key: Any): Int =
{
import com.csc.emp.spark.tutorial.PartitonObj._
key.asInstanceOf[Emp].EMPLOYEE_ID match {
case salary if salary < 10000 => 1
case salary if salary >= 10001 && salary < 20000 => 2
case _ => 3
}
}
}
问题我该如何调用\调用我的客户分区。在数据框中找不到partitionBy。还有其他方法吗?
答案 0 :(得分:1)
只需输入我的评论代码即可
val empDS = List(Emp(5, 1000), Emp(4, 15000), Emp(3, 30000), Emp(2, 2000)).toDS()
println(s"Original partitions number: ${empDS.rdd.partitions.size}")
println("-- Original partition: data --")
empDS.rdd.mapPartitionsWithIndex((index, it) => {
it.foreach(r => println(s"Partition $index: $r")); it
}).count()
val getSalaryGrade = (salary: Int) => salary match {
case salary if salary < 10000 => 1
case salary if salary >= 10001 && salary < 20000 => 2
case _ => 3
}
val getSalaryGradeUDF = udf(getSalaryGrade)
val salaryGraded = empDS.withColumn("salaryGrade", getSalaryGradeUDF($"salary"))
val repartitioned = salaryGraded.repartition($"salaryGrade")
println
println(s"Partitions number after: ${repartitioned.rdd.partitions.size}")
println("-- Reparitioned partition: data --")
repartitioned.as[Emp].rdd.mapPartitionsWithIndex((index, it) => {
it.foreach(r => println(s"Partition $index: $r")); it
}).count()
输出为:
Original partitions number: 2
-- Original partition: data --
Partition 1: Emp(3,30000)
Partition 0: Emp(5,1000)
Partition 1: Emp(2,2000)
Partition 0: Emp(4,15000)
Partitions number after: 5
-- Reparitioned partition: data --
Partition 1: Emp(3,30000)
Partition 3: Emp(5,1000)
Partition 3: Emp(2,2000)
Partition 4: Emp(4,15000)
注意:,可能会有几个分区具有相同的“ salaryGrade”。
建议:“ groupBy”或类似的方法看起来更可靠。
要保留数据集实体,可以使用“ groupByKey”:
empDS.groupByKey(x => getSalaryGrade(x.salary)).mapGroups((index, it) => {
it.foreach(r => println(s"Group $index: $r")); index
}).count()
输出:
Group 1: Emp(5,1000)
Group 3: Emp(3,30000)
Group 1: Emp(2,2000)
Group 2: Emp(4,15000)