我有以下DataFrame df
:
url user date followers
www.test1.com A 2017-01-04 05:46:00 45
www.test1.com B 2017-01-03 10:46:00 10
www.test1.com C 2017-01-05 05:46:00 11
www.test2.com B 2017-01-03 17:00:00 10
www.test2.com A 2017-01-04 15:05:00 45
对于每个不同的url
,我需要找到followers
,user
最早date
的总和,user
的唯一{{1} }}值,最早的date
和最新的date
。
这是我到目前为止所做的:
val wFirstUser = Window.partitionBy($"url",$"user").orderBy($"date".asc)
val result = df
.groupBy("url")
.agg(sum("followers")", countDistinct("user"), min("date"), max("date"))
.withColumn("rn", row_number.over(wFirstUser)).where($"rn" === 1).drop("rn")
预期产出:
url first_user earliest_date latest_date sum_followers distinct_users
www.test1.com B 2017-01-03 10:46:00 2017-01-05 05:46:00 66 3
www.test2.com B 2017-01-04 15:05:00. 2017-01-03 17:00:00 55 2
但我找不到最早user
的{{1}}(即date
)。有人可以帮助我吗?
答案 0 :(得分:1)
您不需要window
功能。您只需要创建一个结构列,按日期对其进行排序,以找到最小日期和相应的用户,其余的事情就像您一样
import org.apache.spark.sql.functions._
val result = df.withColumn("struct", struct("date", "user"))
.groupBy("url")
.agg(sum("followers").as("sum_followers"), countDistinct("user").as("distinct_users"), max("date").as("latest_date"), min("struct").as("struct"))
.select(col("url"), col("struct.user").as("first_user"), col("struct.date").as("earliest_date"), col("latest_date"), col("sum_followers"), col("distinct_users"))
应该给你
+-------------+----------+-------------------+-------------------+-------------+--------------+
|url |first_user|earliest_date |latest_date |sum_followers|distinct_users|
+-------------+----------+-------------------+-------------------+-------------+--------------+
|www.test1.com|B |2017-01-03 10:46:00|2017-01-05 05:46:00|66.0 |3 |
|www.test2.com|B |2017-01-03 17:00:00|2017-01-04 15:05:00|55.0 |2 |
+-------------+----------+-------------------+-------------------+-------------+--------------+