生成涉及数组,数据透视的复杂Pyspark表

时间:2019-06-26 16:39:10

标签: arrays pyspark pivot

我有一个以下格式的表

+-------+--------+
|Column1|Column2 |
+-------+--------+
|[A, 1] |X       |
|[A, 1] |Y       |
|[B, 2] |Y       |
|[B, 2] |Z       |
|[C, 1] |X       |
|[C, 1] |Z       |
+-------+--------+

我需要一个具有以下结果的表:

+-------+-------+-------+-------+
|       |[A, 1] |[B, 2] |[C, 1] |
+-------+-------+-------+-------+
|[A, 1] |[X, Y] |[Y]    |[X]    |
|[B, 2] |[Y]    |[Y, Z] |[Z]    |
|[C, 1] |[X]    |[Z]    |[X, Z] |
+-------+-------+-------+-------+

甚至更好的结果是这样的:

+-------+-------+-------+-------+
|       |[A, 1] |[B, 2] |[C, 1] |
+-------+-------+-------+-------+
|[A, 1] |2      |1      |1      |
|[B, 2] |1      |2      |1      |
|[C, 1] |1      |1      |2      |
+-------+-------+-------+-------+

1 个答案:

答案 0 :(得分:2)

这将非常昂贵,尤其是对于大数据,但是您可以执行join + pivot

from pyspark.sql.functions import count

df.alias("l").join(df.alias("r"), on="Column2")\
    .select("l.Column1", "r.Column1")\
    .groupBy("l.Column1")\
    .pivot("r.Column1")\
    .agg(count("r.Column1"))\
    .show()
#+-------+------+------+------+
#|Column1|[A, 1]|[B, 2]|[C, 1]|
#+-------+------+------+------+
#| [A, 1]|     2|     1|     1|
#| [B, 2]|     1|     2|     1|
#| [C, 1]|     1|     1|     2|
#+-------+------+------+------+