如何在DataFrame列的多个数组中获取对应项的最大值?

时间:2018-07-22 13:18:52

标签: scala dataframe

DataFrame如下:

import spark.implicits._
val df1 = List(
    ("id1", Array(0,2)),
    ("id1",Array(2,1)),
    ("id2",Array(0,3))
  ).toDF("id", "value")

+---+------+
| id| value|
+---+------+
|id1|[0, 2]|
|id1|[2, 1]|
|id2|[0, 3]|
+---+------+

我想对ID进行分组以获取每个值数组的最大池。 id1的最大值是Array(2,2)。我想要得到的结果是:

import spark.implicits._
val res = List(
    ("id1", Array(2,2)),
    ("id2",Array(0,3))
  ).toDF("id", "value")

+---+------+
| id| value|
+---+------+
|id1|[2, 2]|
|id2|[0, 3]|
+---+------+

2 个答案:

答案 0 :(得分:2)

import spark.implicits._
val df1 = List(
  ("id1", Array(0,2,3)),
  ("id1",Array(2,1,4)),
  ("id2",Array(0,7,3))
).toDF("id", "value")
val df2rdd = df1.rdd
  .map(x => (x(0).toString,x.getSeq[Int](1)))
  .reduceByKey((x,y) => {
    val arrlength = x.length
    var i = 0
    val resarr = scala.collection.mutable.ArrayBuffer[Int]()
    while(i < arrlength){
      if (x(i) >= y(i)){
        resarr.append(x(i))
      } else {
        resarr.append(y(i))
      }
      i += 1
    }
    resarr
  }).toDF("id","newvalue")

答案 1 :(得分:1)

您可以按照以下步骤操作

//Input df
+---+---------+
| id|    value|
+---+---------+
|id1|[0, 2, 3]|
|id1|[2, 1, 4]|
|id2|[0, 7, 3]|
+---+---------+

//Solution approach: 
import org.apache.spark.sql.functions.udf

val df1=df.groupBy("id").agg(collect_set("value").as("value"))
val maxUDF = udf{(s:Seq[Seq[Int]])=>s.reduceLeft((prev,next)=>prev.zip(next).map(tup=>if(tup._1>tup._2) tup._1 else tup._2))}
df1.withColumn("value",maxUDF(df1.col("value"))).show

//Sample Output:
+---+---------+
| id|    value|
+---+---------+
|id1|[2, 2, 4]|
|id2|[0, 7, 3]|
+---+---------+

我希望这会对您有所帮助。