返回一行包含pyspark GroupedData中最佳字段的行

时间:2018-08-14 15:03:25

标签: python apache-spark dataframe pyspark

我正在尝试将GroupedData的{​​{1}}对象聚合到具有最佳属性(不是Row或最高None)的timestamp对象中,例如:

Dataframe

我想要一个像这样的结果╔═══════╦═══════════╦════════╦════════╦════════╗ ║ group ║ timestamp ║ value1 ║ value2 ║ value3 ║ ╠═══════╬═══════════╬════════╬════════╬════════╣ ║ a ║ 111 ║ None ║ None ║ None ║ ║ a ║ 222 ║ a ║ None ║ None ║ ║ a ║ 333 ║ b ║ 1 ║ 1.1 ║ ║ a ║ 444 ║ None ║ None ║ 2.2 ║ ║ b ║ 111 ║ c ║ None ║ 3.3 ║ ╚═══════╩═══════════╩════════╩════════╩════════╝

Dataframe

理想情况下,我想创建一个不同的逻辑来聚合每一列。例如╔═══════╦═══════════╦════════╦════════╦════════╗ ║ group ║ timestamp ║ value1 ║ value2 ║ value3 ║ ╠═══════╬═══════════╬════════╬════════╬════════╣ ║ a ║ 444 ║ b ║ 1 ║ 2.2 ║ ║ b ║ 111 ║ c ║ None ║ 3.3 ║ ╚═══════╩═══════════╩════════╩════════╩════════╝ 代表min,而timestamp代表max

value3 s内有可能吗?

谢谢

2 个答案:

答案 0 :(得分:2)

SparkSQL实际上将完全按照您希望的方式进行操作,并且在聚合列时会忽略空值。例如,让我们考虑以下数据帧:

df = sc.parallelize([("a", 1, None), ("b", None, 5), ("a", 2, None), ("b", 0, 7)]).toDF(["A", "B", "C"])

看起来像这样:

+---+----+----+
|  A|   B|   C|
+---+----+----+
|  a|   1|null|
|  b|null|   5|
|  a|   2|null|
|  b|   0|   7|
+---+----+----+

您可以使用以下不同功能进行汇总:

import pyspark.sql.functions as F
df.groupBy("A").agg(F.min(F.col("B")), F.max(F.col("C") ))

并获得期望的结果(当唯一的值和不同的聚合器函数时,将忽略忽略的空值):

+---+------+------+
|  A|min(B)|max(C)|
+---+------+------+
|  b|     0|     7|
|  a|     1|  null|
+---+------+------+

答案 1 :(得分:1)

您可以按照下面的步骤操作来达到目的

# create data frame like below to match your grouped data frame
df = sqlContext.createDataFrame([('a', 111, None, None, None), ('a', 222, 'a', None, None), ('a', 333, 'b', 1, 1.1), ('a', 444, None, None, 2.2), ('b', 111, 'c', None, 3.3)],  ('group', 'timestamp', 'value1', 'value2', 'value3'))

# import necessary functions 
import pyspark.sql.functions as f

# apply group by and agg functions on the data frame
df1 = df.groupBy('group').agg(f.min('timestamp').alias('timestamp'), f.max('value1').alias('value1'),  f.max('value2').alias('value2'), f.max('value3').alias('value3'))

# show the result data frame
df1.show()

# +-----+---------+------+------+------+
# |group|timestamp|value1|value2|value3|
# +-----+---------+------+------+------+
# |    a|      111|     b|     1|   2.2|
# |    b|      111|     c|  null|   3.3|
# +-----+---------+------+------+------+