Question

我有一个带有日期类型的start_date列的DataFrame。现在，我必须在column1中生成唯一值的度量标准，其中start_date等于或等于。以下是输入DataFrame

column1   column2  start_date
id1       val1     2018-03-12
id1       val2     2018-03-12
id2       val3     2018-03-12 
id3       val4     2018-03-12
id4       val5     2018-03-11
id4       val6     2018-03-11
id5       val7     2018-03-11
id5       val8     2018-03-11 
id6       val9     2018-03-10

现在我必须转换成以下内容，

start_date     count
2018-03-12    6
2018-03-11    3
2018-03-10    1

这就是我正在做的事情，这不是一种有效的方法，

找出所有不同的start_dates并存储为列表
循环遍历列表并为每个start_date生成输出
将所有输出合并为一个数据帧。

有没有更好的方法可以不循环？

Answer 1

尝试以下内容 -

std::cout << std::string("Hello, World!" + std::to_string(number)) << std::endl;

以这种模式探索

检查countDistinct - https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions

使用Spark Window - 实施例

groupBy("start_date").agg(countdistinct("column1"))

Answer 2

您可以将标准聚合与窗口功能相结合，但第二阶段不会被分发

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._


df
 .groupBy($"start_date")
 .agg(approx_count_distinct($"column1").alias("count"))
 .withColumn(
   "cumulative_count", sum($"count").over(Window.orderBy($"start_date")))

基于日期和日期之前聚合火花数据帧

2 个答案: