我有一个以下格式的表
+-------+--------+
|Column1|Column2 |
+-------+--------+
|[A, 1] |X |
|[A, 1] |Y |
|[B, 2] |Y |
|[B, 2] |Z |
|[C, 1] |X |
|[C, 1] |Z |
+-------+--------+
我需要一个具有以下结果的表:
+-------+-------+-------+-------+
| |[A, 1] |[B, 2] |[C, 1] |
+-------+-------+-------+-------+
|[A, 1] |[X, Y] |[Y] |[X] |
|[B, 2] |[Y] |[Y, Z] |[Z] |
|[C, 1] |[X] |[Z] |[X, Z] |
+-------+-------+-------+-------+
甚至更好的结果是这样的:
+-------+-------+-------+-------+
| |[A, 1] |[B, 2] |[C, 1] |
+-------+-------+-------+-------+
|[A, 1] |2 |1 |1 |
|[B, 2] |1 |2 |1 |
|[C, 1] |1 |1 |2 |
+-------+-------+-------+-------+
答案 0 :(得分:2)
这将非常昂贵,尤其是对于大数据,但是您可以执行join
+ pivot
:
from pyspark.sql.functions import count
df.alias("l").join(df.alias("r"), on="Column2")\
.select("l.Column1", "r.Column1")\
.groupBy("l.Column1")\
.pivot("r.Column1")\
.agg(count("r.Column1"))\
.show()
#+-------+------+------+------+
#|Column1|[A, 1]|[B, 2]|[C, 1]|
#+-------+------+------+------+
#| [A, 1]| 2| 1| 1|
#| [B, 2]| 1| 2| 1|
#| [C, 1]| 1| 1| 2|
#+-------+------+------+------+