如何递归地获取Spark DataFrame中的所有列

时间:2018-03-13 15:59:13

标签: scala apache-spark dataframe apache-spark-sql

我想获取DataFrame的所有列。如果DataFrame具有平面结构(没有嵌套的StructTypes)df.columns会产生正确的结果。我也希望返回所有嵌套列名,例如:克。

给出

val schema = StructType(
  StructField("name", StringType) ::
  StructField("nameSecond", StringType) ::
  StructField("nameDouble", StringType) ::
  StructField("someStruct", StructType(
    StructField("insideS", StringType)::
    StructField("insideD", DoubleType)::
    Nil
  )) ::
  Nil
)
val rdd = spark.sparkContext.emptyRDD[Row]
val df = spark.createDataFrame(rdd, schema)

我想要

Seq("name", "nameSecond", "nameDouble", "someStruct", "insideS", "insideD")

1 个答案:

答案 0 :(得分:4)

您可以使用此递归函数遍历架构:

def flattenSchema(schema: StructType): Seq[String] = {
  schema.fields.flatMap {
    case StructField(name, inner: StructType, _, _) => Seq(name) ++ flattenSchema(inner)
    case StructField(name, _, _, _) => Seq(name)
  }
}

println(flattenSchema(schema)) 
// prints: ArraySeq(name, nameSecond, nameDouble, someStruct, insideS, insideD)