假设我有以下数据框。 如何在两者之间进行联接,以便获得最终输出,其中结果列(value_2)根据排名列的值考虑要附加的记录数。
import pyspark.sql.functions as f
from pyspark.sql.window import Window
l =[( 9 , 1, 'A' ),
( 9 , 2, 'B' ),
( 9 , 3, 'C' ),
( 9 , 4, 'D' ),
( 10 , 1, 'A' ),
( 10 , 2, 'B' )]
df = spark.createDataFrame(l, ['prod','rank', 'value'])
+----+----+-----+
|prod|rank|value|
+----+----+-----+
| 9| 1| A|
| 9| 2| B|
| 9| 3| C|
| 9| 4| D|
| 10| 1| A|
| 10| 2| B|
+----+----+-----+
sh =[( 9 , ['A','B','C','D'] ),
( 10 , ['A','B'])]
sh = spark.createDataFrame(sh, ['prod', 'conc'])
+----+------------+
|prod| value|
+----+------------+
| 9|[A, B, C, D]|
| 10| [A, B]|
+----+------------+
最终的期望输出:
+----+----+-----+---------+
|prod|rank|value| value_2 |
+----+----+-----+---------+
| 9| 1| A| A |
| 9| 2| B| A,B |
| 9| 3| C| A,B,C |
| 9| 4| D| A,B,C,D|
| 10| 1| A| A |
| 10| 2| B| A,B |
+----+----+-----+---------+
答案 0 :(得分:2)
您可以使用Window函数并在聚合之前执行此操作;在Spark 2.4 +
df.select('*',
f.array_join(
f.collect_list(df.value).over(Window.partitionBy('prod').orderBy('rank')),
','
).alias('value_2')
).show()
+----+----+-----+-------+
|prod|rank|value|value_2|
+----+----+-----+-------+
| 9| 1| A| A|
| 9| 2| B| A,B|
| 9| 3| C| A,B,C|
| 9| 4| D|A,B,C,D|
| 10| 1| A| A|
| 10| 2| B| A,B|
+----+----+-----+-------+
或者如果您不需要将数组作为字符串连接:
df.select('*',
f.collect_list(df.value).over(Window.partitionBy('prod').orderBy('rank')).alias('value_2')
).show()
+----+----+-----+------------+
|prod|rank|value| value_2|
+----+----+-----+------------+
| 9| 1| A| [A]|
| 9| 2| B| [A, B]|
| 9| 3| C| [A, B, C]|
| 9| 4| D|[A, B, C, D]|
| 10| 1| A| [A]|
| 10| 2| B| [A, B]|
+----+----+-----+------------+