Spark:修剪嵌套列/字段

时间:2017-09-10 17:25:15

标签: scala apache-spark apache-spark-sql spark-dataframe apache-spark-dataset

我有一个关于修剪嵌套字段的可能性的问题。

我正在开发高能物理数据格式(ROOT)的来源。

下面是使用我正在开发的DataSource的某个文件的架构。

 root
 |-- EventAuxiliary: struct (nullable = true)
 |    |-- processHistoryID_: struct (nullable = true)
 |    |    |-- hash_: string (nullable = true)
 |    |-- id_: struct (nullable = true)
 |    |    |-- run_: integer (nullable = true)
 |    |    |-- luminosityBlock_: integer (nullable = true)
 |    |    |-- event_: long (nullable = true)
 |    |-- processGUID_: string (nullable = true)
 |    |-- time_: struct (nullable = true)
 |    |    |-- timeLow_: integer (nullable = true)
 |    |    |-- timeHigh_: integer (nullable = true)
 |    |-- luminosityBlock_: integer (nullable = true)
 |    |-- isRealData_: boolean (nullable = true)
 |    |-- experimentType_: integer (nullable = true)
 |    |-- bunchCrossing_: integer (nullable = true)
 |    |-- orbitNumber_: integer (nullable = true)
 |    |-- storeNumber_: integer (nullable = true)

DataSource在这里https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/experimental/package.scala#L62

使用FileFormat的buildReader方法构建阅读器时:

override def buildReaderWithPartitionValues(
    sparkSession: SparkSession,
    dataSchema: StructType,
    partitionSchema: StructType,
    requiredSchema: StructType,
    filters: Seq[Filter],
    options: Map[String, String],
    hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

我看到requiredSchema将始终包含正在查看的顶部列的所有字段/成员。这意味着当我想选择一个特定的嵌套字段时: df.select(" EventAuxiliary.id_.run _"),requiredSchema将再次成为该顶部列的完整结构(" EventAuxiliary")。我希望架构会是这样的:

root
|-- EventAuxiliary: struct...
|  |-- id_: struct ...
|  |    |-- run_: integer

因为这是select语句所需的唯一模式。

基本上,我想知道如何在数据源级别上修剪嵌套字段。我认为requiredSchema将只是来自df.select的字段。

我试图看看avro / plat正在做什么,并发现了这个:https://github.com/apache/spark/pull/14957/files

如果有建议/意见 - 将不胜感激!

谢谢!

VK

0 个答案:

没有答案