当数组很大时,从Scala的Spark Dataframe中的数组列创建单独的列

时间:2018-09-11 12:44:27

标签: scala apache-spark

我有两列:一列是Integer类型,另一列是linalg.Vector类型。我可以将linalg.Vector转换为数组。每个数组有32个元素。我想将数组中的每个元素转换为一列。所以输入就像:

column1                  column2
(3, 5, 25, ...., 12)           3
(2, 7, 15, ...., 10)           4
(1, 10, 12, ..., 35)           2

输出应为:

column1_1  column1_2 column1_3 ......... column1_32     column 2
        3          5        25 .........         12            3
        2          7        15 .........         10            4
        1        1 0        12 .........         12            2

除了,在我的情况下,数组中有32个元素。使用问题Convert Array of String column to multiple columns in spark scala

的方法太多了

我尝试了几种方法,但都无济于事。什么是正确的方法?

非常感谢。

2 个答案:

答案 0 :(得分:3)

scala> import org.apache.spark.sql.Column
scala> val df = Seq((Array(3,5,25), 3),(Array(2,7,15),4),(Array(1,10,12),2)).toDF("column1", "column2")
df: org.apache.spark.sql.DataFrame = [column1: array<int>, column2: int]

scala> def getColAtIndex(id:Int): Column = col(s"column1")(id).as(s"column1_${id+1}")
getColAtIndex: (id: Int)org.apache.spark.sql.Column

scala> val columns: IndexedSeq[Column] = (0 to 2).map(getColAtIndex) :+ col("column2") //Here, instead of 2, you can give the value of n
columns: IndexedSeq[org.apache.spark.sql.Column] = Vector(column1[0] AS `column1_1`, column1[1] AS `column1_2`, column1[2] AS `column1_3`, column2)

scala> df.select(columns: _*).show
+---------+---------+---------+-------+
|column1_1|column1_2|column1_3|column2|
+---------+---------+---------+-------+
|        3|        5|       25|      3|
|        2|        7|       15|      4|
|        1|       10|       12|      2|
+---------+---------+---------+-------+

答案 1 :(得分:1)

这可以通过编写如下的UserDefinedFunction来达到最佳效果:

val getElementFromVectorUDF = udf(getElementFromVector(_: Vector, _: Int))
def getElementFromVector(vec: Vector, idx: Int) = {
   vec(idx)
}

您可以这样使用它:

df.select(
    getElementFromVectorUDF($"column1", 0) as "column1_0",
    ...
    getElementFromVectorUDF($"column1", n) as "column1_n",
)

我希望这会有所帮助。