Question

我正在从事的项目每天需要处理大量的AVRO文件。要从AVRO提取数据，我使用sparkSQL。为此，我首先需要printSchema，然后需要选择字段以查看数据。我想使这个过程自动化。给定任何输入AVRO，我想编写一个脚本，该脚本将自动生成SparkSQL查询（考虑avsc文件中的结构和数组）。我可以用Java或Python编写脚本。

-示例输入AVRO

root
|-- identifier: struct (nullable = true)
|    |-- domain: string (nullable = true)
|    |-- id: string (nullable = true)
|    |-- version: long (nullable = true)
alternativeIdentifiers: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- identifier: struct (nullable = true)
|    |    |    |    |-- domain: string (nullable = true)
|    |    |    |    |-- id: string (nullable = true)

-我期望的输出

SELECT identifier.domain, identifier.id, identifier.version

Answer 1

您可以使用类似这样的内容来基于架构生成列列表：

  import org.apache.spark.sql.types.{StructField, StructType}
  def getStructFieldName(f: StructField, baseName: String = ""): Seq[String] = {
    val bname = if (baseName.isEmpty) "" else baseName + "."
    f.dataType match {
      case StructType(s) =>
        s.flatMap(x => getStructFieldName(x, bname + f.name))
      case _ => Seq(bname + f.name)
    }
  }

然后可以将其用于实际数据框，如下所示：

val data = spark.read.json("some_data.json")
val cols = data.schema.flatMap(x => getStructFieldName(x))

结果是，我们得到了字符串序列，我们可以用它们执行select：

import org.apache.spark.sql.functions.col
data.select(cols.map(col): _*)

或者我们可以生成一个用逗号分隔的列表，可以在spark.sql中使用该列表：

spark.sql(s"select ${cols.mkString(", ")} from table")

我们可以从AVRO模式自动生成Spark SQL查询吗？

1 个答案: