如何在分组操作中查找DataFrame中的最早和最晚日期?

时间:2018-05-09 13:18:15

标签: scala apache-spark apache-spark-sql

我有以下DataFrame df

url            user    date                  followers
www.test1.com  A       2017-01-04 05:46:00   45
www.test1.com  B       2017-01-03 10:46:00   10
www.test1.com  C       2017-01-05 05:46:00   11
www.test2.com  B       2017-01-03 17:00:00   10
www.test2.com  A       2017-01-04 15:05:00   45

对于每个不同的url,我需要找到followersuser最早date的总和,user的唯一{{1} }}值,最早的date和最新的date

这是我到目前为止所做的:

val wFirstUser = Window.partitionBy($"url",$"user").orderBy($"date".asc)
val result = df
                .groupBy("url")
                .agg(sum("followers")", countDistinct("user"), min("date"), max("date"))
                .withColumn("rn", row_number.over(wFirstUser)).where($"rn" === 1).drop("rn")

预期产出:

url            first_user    earliest_date         latest_date           sum_followers   distinct_users
www.test1.com  B             2017-01-03 10:46:00   2017-01-05 05:46:00   66              3
www.test2.com  B             2017-01-04 15:05:00.  2017-01-03 17:00:00   55              2

但我找不到最早user的{​​{1}}(即date)。有人可以帮助我吗?

1 个答案:

答案 0 :(得分:1)

您不需要window功能。您只需要创建一个结构列,按日期对其进行排序,以找到最小日期和相应的用户,其余的事情就像您一样

import org.apache.spark.sql.functions._
val result = df.withColumn("struct", struct("date", "user"))
  .groupBy("url")
  .agg(sum("followers").as("sum_followers"), countDistinct("user").as("distinct_users"), max("date").as("latest_date"), min("struct").as("struct"))
  .select(col("url"), col("struct.user").as("first_user"), col("struct.date").as("earliest_date"), col("latest_date"), col("sum_followers"), col("distinct_users"))

应该给你

+-------------+----------+-------------------+-------------------+-------------+--------------+
|url          |first_user|earliest_date      |latest_date        |sum_followers|distinct_users|
+-------------+----------+-------------------+-------------------+-------------+--------------+
|www.test1.com|B         |2017-01-03 10:46:00|2017-01-05 05:46:00|66.0         |3             |
|www.test2.com|B         |2017-01-03 17:00:00|2017-01-04 15:05:00|55.0         |2             |
+-------------+----------+-------------------+-------------------+-------------+--------------+