有没有办法展平包含不使用UDF的数组数组的列(如果是DataFrames)?
例如:
+---------------------------------------------------------------------------------------------------------+
|vector |
+---------------------------------------------------------------------------------------------------------+
|[[106.0,1006.0,26.0], [107.0,1007.0,27.0], [108.0,1008.0,28.0]] |
|[[100.0,1000.0,20.0]] |
|[[101.0,1001.0,21.0], [102.0,1002.0,22.0], [103.0,1003.0,23.0], [104.0,1004.0,24.0], [105.0,1005.0,25.0]]|
+---------------------------------------------------------------------------------------------------------+
应转换为
+---------------------------------------------------------------------------------------------------------+
|vector |
+---------------------------------------------------------------------------------------------------------+
|[106.0,1006.0,26.0,107.0,1007.0,27.0,108.0,1008.0,28.0]
|[100.0,1000.0,20.0]
|[101.0,1001.0,21.0,102.0,1002.0,22.0,103.0,1003.0,23.0,104.0,1004.0,24.0,105.0,1005.0,25.0]|
+---------------------------------------------------------------------------------------------------------+
答案 0 :(得分:1)
以下是使用rdd
:
from operator import add
df = sqlcx.createDataFrame(
[
("A", [[106.0,1006.0,26.0], [107.0,1007.0,27.0], [108.0,1008.0,28.0]])
],
("Col1", "Col2")
)
df.rdd.map(lambda row: (row['Col1'], reduce(add, row['Col2'])))\
.toDF(['Col1', 'Col2'])\
.show(truncate=False)
#+----+---------------------------------------------------------------+
#|Col1|Col2 |
#+----+---------------------------------------------------------------+
#|A |[106.0, 1006.0, 26.0, 107.0, 1007.0, 27.0, 108.0, 1008.0, 28.0]|
#+----+---------------------------------------------------------------+
然而,序列化为rdd
在性能方面成本很高。我个人会建议使用udf
来完成此任务。据我所知,如果没有udf
只使用spark DataFrame函数,就没有办法做到这一点。
答案 1 :(得分:0)
您可以使用官方文档中提供的展平功能。它是在spark 2.4中引入的。看看这个等效的重复问题: