如何基于数据框中的最大值和最小值添加列集合

时间:2018-10-18 07:49:34

标签: scala apache-spark dataframe

我有这个DataFrame

val for_df = Seq((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-2k")).toDF("min","max","salary")

我想将5k-7k转换为5,6,7,并将4k-8k转换为4,5,6,7,8

原始DataFrame

original dataframe

所需的DataFrame

desired dataframe

a.select("min","max","salary")
      .as[(Integer,Integer,String)]
      .map{
        case(min,max,salary) =>
          (min,max,salary.split("-").flatMap(x => {
            for(i <- 0 to x.length-1) yield (i)
          }))
      }.toDF("1","2","3").show()

3 个答案:

答案 0 :(得分:0)

您需要创建UDF来扩展限制。以下UDF会将5k-7k转换为5,6,7,将4k-8k转换为4,5,6,7,8,依此类推

import org.apache.spark.sql.functions._
val inputDF = sc.parallelize(List((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")

val extendUDF = udf((str: String) => {
  val nums = str.replace("k","").split("-").map(_.toInt)
  (nums(0) to nums(1)).toList.mkString(",")
})

val output = inputDF.withColumn("salary_level", extendUDF($"salary"))

输出:

scala> output.show
+---+---+------+----------------+
|min|max|salary|    salary_level|
+---+---+------+----------------+
|  5|  7| 5k-7k|           5,6,7|
|  4|  8| 4k-8k|       4,5,6,7,8|
|  6| 12|6k-12k|6,7,8,9,10,11,12|
+---+---+------+----------------+

答案 1 :(得分:0)

您可以使用udf轻松做到这一点。

// The following defines a udf in spark which create a list as per your requirement.
val makeRangeLists = udf( (min: Int, max: Int) => List.range(min, max+1) )

val input = sc.parallelize(List((5,7,"5k-7k"),
                          (4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
// Create a new column using the UDF and pass the max and min columns.
input.withColumn("salary_level", makeRangeLists($"min", $"max")).show

答案 2 :(得分:0)

这里是使用UDF的一种快速选择

 import org.apache.spark.sql.functions


  val toSalary = functions.udf((value: String) => {
    val array = value.filterNot(_ == 'k').split("-").map(_.trim.toInt).sorted
    val (startSalary, endSalary) = (array.headOption, array.tail.headOption)

    (startSalary, endSalary) match {
      case (Some(s), Some(e)) => (s to e).toList.mkString(",")
      case _ =>  ""
    }
  })

for_df.withColumn("salary_level", toSalary($"salary")).drop("salary")

输入

+---+---+------+
|min|max|salary|
+---+---+------+
|  5|  7| 5k-7k|
|  4|  8| 4k-8k|
|  6| 12| 6k-2k|
+---+---+------+

结果

+---+---+------------+
|min|max|salary_level|
+---+---+------------+
|  5|  7|       5,6,7|
|  4|  8|   4,5,6,7,8|
|  6| 12|   2,3,4,5,6|
+---+---+------------+

首先,您删除k并用破折号将您的字符串分开。然后,您将获得start和endSalary并执行一个范围以使它们变甜。