假设我有一个DataFrame如下:
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
架构是:
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
| | |-- useless: string (nullable = true)
我正在寻找一种方法来只选择字段的一个子集:数组列id
的{{1}}和size
,但保留嵌套的数组结构。
结果模式将是:
subClasss
我试过做
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
但是这会将数组df.select("subClasss.id","subClasss.size")
分成两个数组:
subClasss
有没有办法保留原始结构,只是为了消除root
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
|-- size: array (nullable = true)
| |-- element: integer (containsNull = true)
字段?看起来像:
useless
感谢您的时间。
答案 0 :(得分:4)
Spark> = 2.4 :
可以将arrays_zip
与cast
:
import org.apache.spark.sql.functions.arrays_zip
df.select(arrays_zip(
$"subClasss.id", $"subClasss.size"
).cast("array<struct<id:string,size:int>>"))
rename nested fields需要cast
- 没有它,Spark会使用自动生成的名称0
,1
,... n
。
Spark&lt; 2.4 强>:
您可以使用这样的UDF:
import org.apache.spark.sql.Row
case class Record(id: String, size: Int)
val dropUseless = udf((xs: Seq[Row]) => xs.map{
case Row(id: String, size: Int, _) => Record(id, size)
})
df.select(dropUseless($"subClasss"))