我正在使用数据帧df,如下所示:
root
|-- array(data1, data2, data3, data4): array (nullable = false)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- k: struct (nullable = false)
| | | | |-- v: string (nullable = true)
| | | | |-- t: string (nullable = false)
| | | |-- resourcename: string (nullable = true)
| | | |-- criticity: string (nullable = true)
| | | |-- v: string (nullable = true)
| | | |-- vn: double (nullable = true)
如df.show()列“数据”类型数组中所述,包含四个数组“ data1”,“ data2”,“ data3”,“ data4”,它们具有相同的数据架构和类型,在
df.withcolumn("Column1",array(col("data1"),col("data2")
,col("data3"),col("data4"))
我想获取包含同一数组中“ data1”,“ data2”,“ data3”和“ data4”所有元素的新数据框。新的架构必须是:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- criticity: string (nullable = true)
| | |-- k: struct (nullable = true)
| | | |-- t: string (nullable = true)
| | | |-- v: string (nullable = true)
| | |-- resourcename: string (nullable = true)
| | |-- v: string (nullable = true)
| | |-- vn: double (nullable = true)
答案 0 :(得分:0)
我建议使用数据集。您应该先定义三个案例类:
case class MyClass1(t: String, v: String)
case class MyClass2(criticity:String, c1:MyClass1, resourcename:String, v:String, vn: Double)
case class MyList(data:Seq[Seq[MyClass2]])
然后像这样创建您的数据集:
val myDS = df.select(array($"data1",$"data2",$"data3",$"data4").as("data")).as[MyList]
// note than myDS.data has the type: list of lists of MyClass2
// Datasets allow us to make this kind of stuff (flatten data)
val myDSFlatten = myDS.flatMap(_.data)
“ myDSFlatten”应具有所需的架构。
请注意,我使用的是Scala。
答案 1 :(得分:0)
如果使用Spark> = 2.4,则可以使用新功能flatten
轻松完成此操作。
flatten(arrayOfArrays)-将数组数组转换为单个数组 数组。