spark:使用更多列在单行上转置更多行

时间:2017-01-31 15:06:23

标签: apache-spark

我在Spark中有这种情况

+-----+-----+-----+----------+-----------+-----------+
|month|years|id   |  category|sum(amount)|avg(amount)|
+-----+-----+-----+----------+-----------+-----------+
|  1  | 2015| id_1|     A    |   10000   |    2000   |
|  1  | 2015| id_1|     B    |   1000    |    100    |
|  1  | 2015| id_1|     C    |   2000    |    1000   |
+-----+-----+-----+----------+-----------+-----------+

我希望得到这个:

+-----------------+-----------------------+-----------------------------------------------+
|                 |      category_A       |        category_B     |      category_C       | 
+-----+-----+-----+-----------+-----------+-----------+-----------+-----------+-----------+
|month|years|id   |sum(amount)|avg(amount)|sum(amount)|avg(amount)|sum(amount)|avg(amount)|
+-----+-----+-----+-----------+-----------+-----------+-----------+-----------+-----------+
|  1  | 2015| id_1|  10000    |    2000   |   1000    |    100    |   2000    |    1000   |
+-----+-----+-----+-----------+-----------+-----------+-----------+-----------+-----------+

有可能吗?

1 个答案:

答案 0 :(得分:0)

我使用dataframe和pivot找到了这个解决方案:

df
  .groupBy($"month",$"years",$"id")
  .pivot("category")
  .agg(sum($"amount"),avg($"amount"))

rdd?

的解决方案是可能的