Question

我在spark数据框中有一列列表。

如何将数组转换为Spark数据框，其中列表中的每个元素都是数据框中的一列？

我是scala的新手，我想用scala解决它。

例如：

Answer 1

您可以通过创建行的RDD，创建模式并将其用于将RDD转换为数据帧来实现。

// A seq of seqs
val s = Seq(1 to 6, 1 to 6, 1 to 6)
// Let's create a RDD of Rows
val rdd = sc.parallelize(s).map(Row.fromSeq)

// Let's define a schema based on the first seq of s
val schema = StructType(
    (1 to s(0).size).map(i => StructField("c"+i, IntegerType, true))
)
// And let's finally create the dataframe
val df = spark.createDataFrame(rdd, schema)
df.show

// +---+---+---+---+---+---+
// | c1| c2| c3| c4| c5| c6|
// +---+---+---+---+---+---+
// |  1|  2|  3|  4|  5|  6|
// |  1|  2|  3|  4|  5|  6|
// |  1|  2|  3|  4|  5|  6|
// +---+---+---+---+---+---+

Answer 2

如果您有问题中提到的数据框，且数组列为

root
 |-- features: array (nullable = true)
 |    |-- element: integer (containsNull = false)

然后您可以使用以下逻辑

val finalCols = Array("c1", "c2", "c3", "c4", "c5", "c6", "c7")

import org.apache.spark.sql.functions._
finalCols.zipWithIndex.foldLeft(df){(tempdf, c) => tempdf.withColumn(c._1, col("features")(c._2))}.select(finalCols.map(col): _*).show(false)

应该给您

+---+---+---+---+---+---+---+
|c1 |c2 |c3 |c4 |c5 |c6 |c7 |
+---+---+---+---+---+---+---+
|0  |45 |63 |0  |0  |0  |0  |
|0  |0  |0  |85 |0  |69 |0  |
|0  |89 |56 |0  |0  |0  |0  |
+---+---+---+---+---+---+---+

或者您可以将udf函数用作

import org.apache.spark.sql.functions._
def splitArrayUdf = udf((features: Seq[Int]) => testCaseClass(features(0), features(1), features(2), features(3), features(4), features(5), features(6)))

df.select(splitArrayUdf(col("features")).as("features")).select(col("features.*")).show(false)

应该为您提供相同的结果

我希望答案会有所帮助

将列表列表转换为数据框

2 个答案: