Question

我在DataFrame df中有以下列：

c_id    p_id     type  values
278230  57371100 11    1
278230  57371100 12    1
...

我执行以下代码，希望看到列11_total和12_total：

df
 .groupBy($"c_id",$"p_id")
 .pivot("type")
 .agg(sum("values") as "total")
 .na.fill(0)
 .show()

相反，我得到了列11和12：

+-----------+----------+---+---+                                                
|       c_id|      p_id| 11| 12|
+-----------+----------+---+---+
|     278230|  57371100|  0|  1|
|     337790|  72031970|  3|  0|
|     320710|  71904400|  0|  1|

为什么？

Answer 1

这是因为只有在为了清晰起见有多个聚合时，Spark才会将别名附加到数据透视表列值：

val df = Seq(
  (278230, 57371100, 11, 1),
  (278230, 57371100, 12, 2),
  (337790, 72031970, 11, 1),
  (337790, 72031970, 11, 2),
  (337790, 72031970, 12, 3)
)toDF("c_id", "p_id", "type", "values")

df.groupBy($"c_id", $"p_id").pivot("type").
  agg(sum("values").as("total")).
  show
// +------+--------+---+---+
// |  c_id|    p_id| 11| 12|
// +------+--------+---+---+
// |337790|72031970|  3|  3|
// |278230|57371100|  1|  2|
// +------+--------+---+---+

df.groupBy($"c_id", $"p_id").pivot("type").
  agg(sum("values").as("total"), max("values").as("max")).
  show
// +------+--------+--------+------+--------+------+
// |  c_id|    p_id|11_total|11_max|12_total|12_max|
// +------+--------+--------+------+--------+------+
// |337790|72031970|       3|     2|       3|     3|
// |278230|57371100|       1|     1|       2|     2|
// +------+--------+--------+------+--------+------+

理解枢轴和聚合

1 个答案: