Question

我使用Spark Scala（1.6.2）进行转换后有一个数据集。我得到了以下两个数据帧。

DF1：

|date | country | count|
| 1872| Scotland|     1|    
| 1873| England |     1|    
| 1873| Scotland|     1|    
| 1875| England |     1|    
| 1875| Scotland|     2|

DF2：

| date| country | count|
| 1872| England |     1|
| 1873| Scotland|     1|
| 1874| England |     1|
| 1875| Scotland|     1|
| 1875| Wales   |     1|

现在，从两个数据框的上方，我希望按国家/地区按日期汇总。喜欢以下输出。我尝试使用union并加入但无法获得所需的结果。

上述两个数据帧的预期输出：

| date| country | count|
| 1872| England |     1|
| 1872| Scotland|     1|
| 1873| Scotland|     2|
| 1873| England |     1|
| 1874| England |     1|
| 1875| Scotland|     3|
| 1875| Wales   |     1|
| 1875| England |     1|

请帮助我解决问题。

Answer 1

最好的方法是通过两列执行union，然后执行groupBy，然后使用sum，您可以指定要添加的列：

df1.unionAll(df2)
   .groupBy("date", "country")
   .sum("count")

输出：

+----+--------+----------+
|date| country|sum(count)|
+----+--------+----------+
|1872|Scotland|         1|
|1875| England|         1|
|1873| England|         1|
|1875|   Wales|         1|
|1872| England|         1|
|1874| England|         1|
|1873|Scotland|         2|
|1875|Scotland|         3|
+----+--------+----------+

Answer 2

使用DataFrame API，您可以使用unionAll后跟groupBy来实现此目的。

DF1.unionAll(DF2)
  .groupBy("date", "country")
  .agg(sum($"count").as("count"))

这将首先将两个数据帧中的所有行放入单个数据帧中。然后，通过对日期和国家/地区列进行分组，可以按要求获得每个国家/地区的计数列的总和。 as("count")部分重命名聚合列以进行计数。

注意：在较新的Spark版本（阅读版本2.0+）中，unionAll已弃用，并被union替换。

使用Spark Scala合并和聚合数据帧

2 个答案: