我有一个如下所示的数据框:
+---+-----+--------------------------------------------------------------------------------------------------+------+
|uid|label|features |weight|
+---+-----+--------------------------------------------------------------------------------------------------+------+
|1 |1.0 |[WrappedArray([animal_indexed,2.0,animal_indexed]), WrappedArray([talk_indexed,3.0,talk_indexed])]|1 |
|2 |0.0 |[WrappedArray([animal_indexed,1.0,animal_indexed]), WrappedArray([talk_indexed,2.0,talk_indexed])]|1 |
|3 |1.0 |[WrappedArray([animal_indexed,0.0,animal_indexed]), WrappedArray([talk_indexed,1.0,talk_indexed])]|1 |
|4 |2.0 |[WrappedArray([animal_indexed,0.0,animal_indexed]), WrappedArray([talk_indexed,0.0,talk_indexed])]|1 |
+---+-----+--------------------------------------------------------------------------------------------------+------+
模式是
root
|-- uid: integer (nullable = false)
|-- label: double (nullable = false)
|-- features: array (nullable = false)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- value: double (nullable = false)
| | | |-- term: string (nullable = true)
|-- weight: integer (nullable = false)
但是我想将特征从Array [Array]转换为Array 也就是说,将列数组映射到同一列中以获得类似
的架构 root
|-- uid: integer (nullable = false)
|-- label: double (nullable = false)
|-- features: array (nullable = false)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- value: double (nullable = false)
| | | |-- term: string (nullable = true)
|-- weight: integer (nullable = false)
谢谢。
答案 0 :(得分:1)
您应该将数据作为具有模式的数据集读取:
case class Something(name: String, value: Double, term: String)
case class MyClass(uid: Int, label: Double, array: Seq[Seq[Something]], weight: Int)
然后使用这样的UDF:
val flatUDF = udf((list: Seq[Seq[Something]]) => list.flatten)
val flattedDF = myDataFrame.withColumn("flatten", flatUDF($"features"))
读取数据集的示例:
val myDataFrame = spark.read.json(path).as[MyClass]
希望这会有所帮助。