对于给定的pyspark数据帧,汇总列的最佳方法是什么是内容列表,并在内容是列表列表的情况下创建新列?
示例输入:
id_1|id_2|id_3| timestamp |thing1 |thing2 |thing3
A |b | c |[time_0,time_1,time_2]|[1.2,1.1,2.2]|[1.3,1.5,2.6]|[2.5,3.4,2.9]
A |b | d |[time_0,time_1] |[5.1,6.1] |[5.5,6.2] |[5.7,6.3]
A |b | e |[time_0,time_1] |[0.1,0.2] |[0.5,0.3] |[0.9,0.6]
示例输出:
id_1|id_2|id_3| timestamp |agg_things
A |b | c |[time_0,time_1,time_2]|[[1.2,1.1,2.2],[1.3,1.5,2.6],[2.5,3.4,2.9]]
A |b | d |[time_0,time_1] |[[5.1,6.1],[5.5,6.2],[5.7,6.3]]
A |b | e |[time_0,time_1] |[[0.1,0.2],[0.5,0.3],[0.9,0.6]]
答案 0 :(得分:0)
我为此找到了一个简单的代码:
example_df.withColumn('agg_things', array(col("thing1"), col("thing2"), col("thing3")))