从Spark DataFrame列中的Array获取最小值

时间:2018-03-20 18:41:29

标签: scala apache-spark

我有一个带数组的DataFrame。

val DF = Seq(
  ("123", "|1|2","3|3|4" ),
  ("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))

|id           |complete1|complete2|
+-------------+---------+---------+
|          123| [, 1, 2]|[3, 3, 4]|
|          124| [, 3, 2]| [, 3, 4]|
+-------------+---------+---------+

如何提取每个数组的最小值?

|id           |complete1|complete2|
+-------------+---------+---------+
|          123| 1       | 3       |
|          124| 2       | 3       |
+-------------+---------+---------+

我已尝试定义UDF来执行此操作,但我收到错误。

def minArray(a:Array[String]) :String = a.filter(_.nonEmpty).min.mkString
val minArrayUDF = udf(minArray _)   
def getMinArray(df: DataFrame, i: Int): DataFrame = df.withColumn("complete" + i, minArrayUDF(df("complete" + i)))

val minDf = (1 to 2).foldLeft(DF){ case (df, i) => getMinArray(df, i)}

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;

4 个答案:

答案 0 :(得分:2)

您可以将parent app: <style> .pretty-button { color: green } </style> <body> <button class="pretty-button">Got It</button> <custom-element></custom-element> </body> web-component made by shadow dom: <!--doesn't work because the shadow dom can't use parent css class--> <body> <button class="pretty-button">Got it from shadow dom</button> </body> 功能定义如下

udf

并将其命名为

def minUdf = udf((arr: Seq[String])=> arr.filterNot(_ == "").map(_.toInt).min)

应该给你

DF.select(col("id"), minUdf(col("complete1")).as("complete1"), minUdf(col("complete2")).as("complete2")).show(false)

<强>更新

如果如果传递给udf函数的数组为空或空字符串数组那么你会遇到

  
    

java.lang.UnsupportedOperationException:empty.min

  

您应该使用+---+---------+---------+ |id |complete1|complete2| +---+---------+---------+ |123|1 |3 | |124|2 |3 | +---+---------+---------+ 函数中的if else条件处理

udf

我希望答案很有帮助

答案 1 :(得分:1)

以下是在不使用udf

的情况下执行此操作的方法

首先explode使用split()获得的数组,然后使用相同的ID进行分组并找到min

  val DF = Seq(
    ("123", "|1|2","3|3|4" ),
    ("124", "|3|2","|3|4" )
  ).toDF("id", "complete1", "complete2")
    .select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
    .withColumn("complete1", explode($"complete1"))
    .withColumn("complete2", explode($"complete2"))
    .groupBy($"id").agg(min($"complete1".cast(IntegerType)).as("complete1"), min($"complete2".cast(IntegerType)).as("complete2"))

输出:

+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|124|2        |3        |
|123|1        |3        |
+---+---------+---------+

答案 2 :(得分:1)

您不需要UDF,可以使用sort_array

val DF = Seq(
  ("123", "|1|2","3|3|4" ),
  ("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
  .select(
    $"id",
    split(regexp_replace($"complete1","^\\|",""), "\\|").as("complete1"),
    split(regexp_replace($"complete2","^\\|",""), "\\|").as("complete2")
  )


// now select minimum
DF.
 .select(
  $"id",
  sort_array($"complete1")(0).as("complete1"),
  sort_array($"complete2")(0).as("complete2")
).show()

+---+---------+---------+
| id|complete1|complete2|
+---+---------+---------+
|123|        1|        3|
|124|        2|        3|
+---+---------+---------+

请注意,我在拆分前删除了前导|以避免数组中的空字符串

答案 3 :(得分:0)

从Spark 2.4开始,您可以使用array_min在数组中查找最小值。要使用此功能,您首先必须将字符串数组转换为整数数组。投射还将通过将空字符串转换为null值来解决。

DF.select($"id",
          array_min(expr("cast(complete1 as array<int>)")).as("complete1"),
          array_min(expr("cast(complete2 as array<int>)")).as("complete2"))