对于星火数据框df
,我有
(col1, col2, col3)
用于groupBy操作,即df.groupBy(col1, col2, col3)
,用于df
col4
,col5
等中的其他列。因此,我应该如何获取具有最大值的表col4
中每个组中col5
,(col1, col2, col3)
的值?我期望进行以下操作:
df.groupBy(col1, col2, col3).max(...)
结果应该像
+---------+--------+---------+----+---+
|col1 |col2 |col3_max |col4_max|
+---------+--------+---------+----+---+
|1021 |a | . | . |
|1000 |b | . | . |
|1011 |c | . | . |
+---------+--------+---+----+---+-----+
答案 0 :(得分:0)
您所拥有的应该起作用。我只是通过打开spark-shell
来尝试过。
scala> val df = Seq((1,2,3,4,5,6), (1,2,3,9,3,2), (2,3,3,1,1,1).toDF("col1", "col2", "col3", "col4", "col5", "col6")
df: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 4 more fields]
scala> df.groupBy("col1", "col2", "col3").max("col4", "col5", "col6").show()
+----+----+----+---------+---------+---------+
|col1|col2|col3|max(col4)|max(col5)|max(col6)|
+----+----+----+---------+---------+---------+
| 2| 3| 3| 1| 1| 1|
| 1| 2| 3| 9| 5| 6|
+----+----+----+---------+---------+---------+
如果您希望不命名所有列,则可以执行以下操作,但是您需要过滤掉col1,col2和col3的最大值:
scala> df.groupBy("col1", "col2", "col3").max().show()
+----+----+----+---------+---------+---------+---------+---------+---------+
|col1|col2|col3|max(col1)|max(col2)|max(col3)|max(col4)|max(col5)|max(col6)|
+----+----+----+---------+---------+---------+---------+---------+---------+
| 2| 3| 3| 2| 3| 3| 1| 1| 1|
| 1| 2| 3| 1| 2| 3| 9| 5| 6|
+----+----+----+---------+---------+---------+---------+---------+---------+