SPARK DataFrame:删除组中的MAX值

时间:2016-05-11 00:49:47

标签: apache-spark dataframe apache-spark-sql

我的数据如下:

id | val
---------------- 
a1 |  10
a1 |  20
a2 |  5
a2 |  7
a2 |  2

如果我将" id"分组,我试图删除组中具有MAX(val)的行。

结果应该是:

id | val
---------------- 
a1 |  10
a2 |  5
a2 |  2

我正在使用SPARK DataFrame和SQLContext。我需要一些方法:

DataFrame df = sqlContext.sql("SELECT * FROM jsontable WHERE (id, val) NOT IN (SELECT is,MAX(val) from jsontable GROUP BY id)");

我该怎么做?

3 个答案:

答案 0 :(得分:3)

您可以使用数据框操作和窗口函数来完成此操作。假设您的数据位于数据框df1

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val maxOnWindow = max(col("val")).over(Window.partitionBy(col("id")))
val df2 = df1
  .withColumn("max", maxOnWindow)
  .where(col("val") < col("max"))
  .select("id", "val")

在Java中,等价物如下:

import org.apache.spark.sql.functions.Window;
import static org.apache.spark.sql.functions.*;

Column maxOnWindow = max(col("val")).over(Window.partitionBy("id"));
DataFrame df2 = df1
    .withColumn("max", maxOnWindow)
    .where(col("val").lt(col("max")))
    .select("id", "val");

这是一篇关于窗口函数的好文章:https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

答案 1 :(得分:1)

下面是Mario的scala代码的Java实现:

iarFiles

答案 2 :(得分:0)

以下是使用RDD和更多Scala风格的方法来实现此目的:

// Let's first get the data in key-value pair format
val data = sc.makeRDD( Seq( ("a",20), ("a", 1), ("a",8), ("b",3), ("b",10), ("b",9) ) )

// Next let's find the max value from each group
val maxGroups = data.reduceByKey( Math.max(_,_) )

// We join the max in the group with the original data
val combineMaxWithData = maxGroups.join(data)

// Finally we filter out the values that agree with the max
val finalResults = combineMaxWithData.filter{ case (gid, (max,curVal)) => max != curVal }.map{ case (gid, (max,curVal)) => (gid,curVal) }


println( finalResults.collect.toList )
>List((a,1), (a,8), (b,3), (b,9))