我在Spark中有这种情况
+-----+-----+-----+----------+-----------+-----------+
|month|years|id | category|sum(amount)|avg(amount)|
+-----+-----+-----+----------+-----------+-----------+
| 1 | 2015| id_1| A | 10000 | 2000 |
| 1 | 2015| id_1| B | 1000 | 100 |
| 1 | 2015| id_1| C | 2000 | 1000 |
+-----+-----+-----+----------+-----------+-----------+
我希望得到这个:
+-----------------+-----------------------+-----------------------------------------------+
| | category_A | category_B | category_C |
+-----+-----+-----+-----------+-----------+-----------+-----------+-----------+-----------+
|month|years|id |sum(amount)|avg(amount)|sum(amount)|avg(amount)|sum(amount)|avg(amount)|
+-----+-----+-----+-----------+-----------+-----------+-----------+-----------+-----------+
| 1 | 2015| id_1| 10000 | 2000 | 1000 | 100 | 2000 | 1000 |
+-----+-----+-----+-----------+-----------+-----------+-----------+-----------+-----------+
有可能吗?
答案 0 :(得分:0)
我使用dataframe和pivot找到了这个解决方案:
df
.groupBy($"month",$"years",$"id")
.pivot("category")
.agg(sum($"amount"),avg($"amount"))
rdd?
的解决方案是可能的