Rename key in a nested Spark DataFrame Schema (Scala)

时间:2019-01-18 18:16:26

标签: scala apache-spark schema parquet

I have a use case that needs to read a nested JSON schema and write it back as a Parquet (My schema changes based on the day I am reading the data so I don't know the exact schema in advance) since in some of my nest keys I have some character like space when I want to save it as parquet I am getting an exception complaining about special character ,;{}()\\n\\t=

This is a sample Schema it's not real schema keys are dynamic and chages day by day

  val nestedSchema = StructType(Seq(
    StructField("event_time", StringType),
    StructField("event_id", StringType),
    StructField("app", StructType(Seq(
      StructField("environment", StringType),
      StructField("name", StringType),
      StructField("type", StructType(Seq(
        StructField("word tier", StringType), ### This cause problem when you save it as Parquet
        StructField("level", StringType)
    ))
 ))))))

val nestedDF = spark.createDataFrame(sc.emptyRDD[Row], nestedSchema)

myDF.printSchema

Output

root
 |-- event_time: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- app: struct (nullable = true)
 |    |-- environment: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- type: struct (nullable = true)
 |    |    |-- word tier: string (nullable = true)
 |    |    |-- level: string (nullable = true)

Trying to save as parquet

myDF.write
          .mode("overwrite")
          .option("compression", "snappy")
          .parquet("PATH/TO/DESTINATION")

I Found some solution like this

myDF.toDF(myDF
          .schema
          .fieldNames
          .map(name => "[ ,;{}()\\n\\t=]+".r.replaceAllIn(name, "_")): _*)
.write
              .mode("overwrite")
              .option("compression", "snappy")
              .parquet("PATH/TO/DESTINATION")

But it only works on a parent keys, not on a nested one. Is there any recursive solution for this?

My question is not a duplicate of this question Since my schema is dynamic and I don't know about what are my keys. It changes based on the data I am reading, so my solution should be generic, I need somehow recursively created the same schema structure but with a key a correct name.

1 个答案:

答案 0 :(得分:0)

基本上,您必须构造一个Column表达式,该表达式会将您的输入转换为带有经过清理的字段名称的类型。为此,可以使用org.apache.spark.sql.functions.struct函数,该函数允许您结合其他Column来构造结构类型的列。这样的事情应该起作用:

  import org.apache.spark.sql.{functions => f}

  def sanitizeName(s: String): String = s.replace(" ", "_")

  def sanitizeFieldNames(st: StructType, context: String => Column): Column = f.struct(
    st.fields.map { sf =>
      val sanitizedName = sanitizeName(sf.name)
      val sanitizedField = sf.dataType match {
        case struct: StructType =>
          val subcontext = context(sf.name)
          sanitizeFieldNames(struct, subcontext(_))
        case _ => context(sf.name)
      }
      sanitizedField.as(sanitizedName)
    }: _*
  )

您可以这样使用它:

val df: DataFrame = ...

val appFieldType = df.schema("app").asInstanceOf[StructType]  // or otherwise obtain the field type
df.withColumn(
  "app",
  sanitizeFieldNames(appFieldType, df("app")(_))
)

对于您的类型,此递归函数将返回类似

的列
f.struct(
  df("app")("environment").as("environment"),
  df("app")("name").as("name"),
  f.struct(
    df("app")("type")("word tier").as("word_tier"),
    df("app")("type")("level").as("level")
  ).as("type")
)

然后将其分配给“ app”字段,替换其中的内容。

但是,此解决方案有局限性。它不支持嵌套的数组或映射:如果您在数组或映射中具有包含结构的架构,则此方法将不会在数组和映射中转换任何结构。就是说,在Spark 2.4中,他们添加了对集合执行操作的函数,因此在Spark 2.4中,该函数有可能被通用化以支持嵌套数组和映射。

最后,可以用mapPartitions做您想做的事。首先,您编写一个递归方法,该方法仅清除您字段的StructType

def sanitizeType(dt: DataType): DataType = dt match {
  case st: StructType => ...  // rename fields and invoke recursively
  case at: ArrayType => ...  // invoke recursively
  case mt: MapType => ...  // invoke recursively
  case _ => dt  // simple types do not have anything to sanitize
}

第二,将已清理的架构应用于数据框。基本上有两种方法可以做到:一种安全的mapPartitions和一种依靠内部Spark API的方法。

使用mapPartitions,很简单:

df.mapPartitions(identity)(RowEncoder(sanitizeType(df.schema)))

在这里,我们应用mapPartitions操作,并明确指定输出编码器。请记住,Spark中的模式不是数据固有的:它们始终与特定的数据帧相关联。数据框内的所有数据均表示为行,在各个字段上没有标签,只是位置。只要您的架构在相同位置上具有完全相同的类型(但名称可能不同),它便会按预期工作。

mapPartitions确实会在逻辑计划中导致几个其他节点。为了避免这种情况,可以直接使用特定的编码器构造一个Dataset[Row]实例:

new Dataset[Row](df.sparkSession, df.queryExecution.logical, RowEncoder(sanitizeType(df.schema)))

这样可以避免不必要的mapPartitions(通常会导致查询执行计划中的反序列化映射序列化步骤),但可能并不安全;我个人现在没有检查它,但是它可以为您工作。