Question

我想获取DataFrame的所有列。如果DataFrame具有平面结构（没有嵌套的StructTypes）df.columns会产生正确的结果。我也希望返回所有嵌套列名，例如：克。

给出

val schema = StructType(
  StructField("name", StringType) ::
  StructField("nameSecond", StringType) ::
  StructField("nameDouble", StringType) ::
  StructField("someStruct", StructType(
    StructField("insideS", StringType)::
    StructField("insideD", DoubleType)::
    Nil
  )) ::
  Nil
)
val rdd = spark.sparkContext.emptyRDD[Row]
val df = spark.createDataFrame(rdd, schema)

我想要

Seq("name", "nameSecond", "nameDouble", "someStruct", "insideS", "insideD")

Answer 1

您可以使用此递归函数遍历架构：

def flattenSchema(schema: StructType): Seq[String] = {
  schema.fields.flatMap {
    case StructField(name, inner: StructType, _, _) => Seq(name) ++ flattenSchema(inner)
    case StructField(name, _, _, _) => Seq(name)
  }
}

println(flattenSchema(schema)) 
// prints: ArraySeq(name, nameSecond, nameDouble, someStruct, insideS, insideD)

如何递归地获取Spark DataFrame中的所有列

1 个答案: