Question

我想将数据帧的架构拆分为集合。我正在尝试这个，但架构打印为字符串。无论如何我可以将它分成每个StructType的集合，以便我可以操作它（比如从输出中只取出数组列）？我试图压扁复杂的多级结构+数组数据帧。

import org.apache.spark.sql.functions.explode
import org.apache.spark.sql._

val test = sqlContext.read.json(sc.parallelize(Seq("""{"a":1,"b":[2,3],"d":[2,3]}""")))

test.printSchema

val flattened = test.withColumn("b", explode($"d"))

flattened.printSchema

def identifyArrayColumns(dataFrame : DataFrame) = {
    val output = for ( d <- dataFrame.collect()) yield
    {
       d.schema
    }
    output.toList
}


identifyArrayColumns(test)

目前输出

identifyArrayColumns: (dataFrame: org.apache.spark.sql.DataFrame)List[org.apache.spark.sql.types.StructType]
res58: List[org.apache.spark.sql.types.StructType] = List(StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true)))

这是一个完整的字符串，所以我不能只过滤数组列。假设我做了一个foreach（println）。我只得到一行

scala> output.foreach(println)
StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true))

我想要的是集合中单个元素中的每个StructTypes

Answer 1

您只需为类型为fields的字段过滤DataFrame架构的array - 无需为此检查DataFrame的数据：

def identifyArrayColumns(schema: StructType): List[StructField] = {
  schema.fields.filter(_.dataType.typeName == "array").toList
}

注意，这是一个“浅层”解决方案，它只能直接在“root”下返回数组字段，如果你还要在Arrays / maps / structs中找到Arrays，你需要递归遍历shcema并产生这个过滤结果，如：

// can be converted into a tail-recursive method by adding another argument to accumulate results
def identifyArrayColumns(schema: StructType): List[StructField] = {
  val arrays = schema.fields.filter(_.dataType.typeName == "array").toList
  val deeperArrays = schema.fields.flatMap {
    case f @ StructField(_, s: StructType, _, _) => identifyArrayColumns(s)
    case _ => List()
  }
  arrays ++ deeperArrays
}

如何将一串数据帧模式拆分为每个Structs

1 个答案: