Question

我目前正在使用结构化流来吸收来自Kafka的消息

此消息的原始格式具有以下架构结构

root
 |-- incidentMessage: struct (nullable = true)
 |    |-- AssignedUnitEvent: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- CallNumber: string (nullable = true)
 |    |    |    |-- Code: string (nullable = true)
 |    |    |    |-- EventDateTime: string (nullable = true)
 |    |    |    |-- EventDispatcherID: string (nullable = true)
 |    |    |    |-- ID: string (nullable = true)
 |    |    |    |-- Notes: string (nullable = true)
 |    |    |    |-- PhoneNumberCalled: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- SubCallNumber: string (nullable = true)
 |    |    |    |-- SupItemNumber: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- UnitID: string (nullable = true)
 |-- preamble: struct (nullable = true)
 |    |-- gateway: string (nullable = true)
 |    |-- product: string (nullable = true)
 |    |-- psap: string (nullable = true)
 |    |-- refDataVersion: long (nullable = true)
 |    |-- source: string (nullable = true)
 |    |-- timestamp: string (nullable = true)
 |    |-- uuid: string (nullable = true)
 |    |-- vendor: string (nullable = true)
 |    |-- version: string (nullable = true)
 |-- raw: string (nullable = true)

但是我在定义消息的模式时出错（在流式组件中），我写了将所有根列都转换为String的代码。

这是我写的代码

//Define the schema

val schema1 = new StructType().add("preamble",DataTypes.StringType).add("incidentMessage",DataTypes.StringType).add("raw",DataTypes.StringType)

//Apply the schema to the message (payload)

val finalResult = Df.withColumn("FinalFrame",from_json($"payload",schema1)).select($"FinalFrame.*")

现在我的数据框看起来像这样

scala> finalResult.printSchema
root
 |-- incidentMessage: string (nullable = true)
 |-- preamble: string (nullable = true)
 |-- raw: string (nullable = true)

我现在有大量的消息，它们具有不正确的架构。我尝试将正确的架构应用于我现在拥有的消息，但是写入文件系统的那组消息具有可变的架构（事件消息），这种方法行不通（我搞砸了，应该使用Avro）

有没有办法恢复这些数据并以正确的格式保存数据？

Answer 1

尽管仅创建1个字段并没有多大意义，但是您可以使用struct函数来做到这一点：

import org.apache.spark.sql.functions.struct

df.withColumn("incidentMessage",struct($"incidentMessage"))

如何在Spark中将Dataframe的String列转换为Struct

1 个答案: