在Scala / Spark中将Array of Array的列转换为Array类型

时间:2018-09-08 16:04:34

标签: scala apache-spark dataframe

我有一个如下所示的数据框:

+---+-----+--------------------------------------------------------------------------------------------------+------+
|uid|label|features                                                                                          |weight|
+---+-----+--------------------------------------------------------------------------------------------------+------+
|1  |1.0  |[WrappedArray([animal_indexed,2.0,animal_indexed]), WrappedArray([talk_indexed,3.0,talk_indexed])]|1     |
|2  |0.0  |[WrappedArray([animal_indexed,1.0,animal_indexed]), WrappedArray([talk_indexed,2.0,talk_indexed])]|1     |
|3  |1.0  |[WrappedArray([animal_indexed,0.0,animal_indexed]), WrappedArray([talk_indexed,1.0,talk_indexed])]|1     |
|4  |2.0  |[WrappedArray([animal_indexed,0.0,animal_indexed]), WrappedArray([talk_indexed,0.0,talk_indexed])]|1     |
+---+-----+--------------------------------------------------------------------------------------------------+------+

模式是

root
 |-- uid: integer (nullable = false)
 |-- label: double (nullable = false)
 |-- features: array (nullable = false)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- value: double (nullable = false)
 |    |    |    |-- term: string (nullable = true)
 |-- weight: integer (nullable = false)

但是我想将特征从Array [Array]转换为Array 也就是说,将列数组映射到同一列中以获得类似

的架构
  root
     |-- uid: integer (nullable = false)
     |-- label: double (nullable = false)
     |-- features: array (nullable = false)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- name: string (nullable = true)
     |    |    |    |-- value: double (nullable = false)
     |    |    |    |-- term: string (nullable = true)
     |-- weight: integer (nullable = false)

谢谢。

1 个答案:

答案 0 :(得分:1)

您应该将数据作为具有模式的数据集读取:

case class Something(name: String, value: Double, term: String)
case class MyClass(uid: Int, label: Double, array: Seq[Seq[Something]], weight: Int)

然后使用这样的UDF:

val flatUDF = udf((list: Seq[Seq[Something]]) => list.flatten)

val flattedDF = myDataFrame.withColumn("flatten", flatUDF($"features"))

读取数据集的示例:

val myDataFrame = spark.read.json(path).as[MyClass]

希望这会有所帮助。