我目前正在使用结构化流来吸收来自Kafka的消息
此消息的原始格式具有以下架构结构
root
|-- incidentMessage: struct (nullable = true)
| |-- AssignedUnitEvent: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- CallNumber: string (nullable = true)
| | | |-- Code: string (nullable = true)
| | | |-- EventDateTime: string (nullable = true)
| | | |-- EventDispatcherID: string (nullable = true)
| | | |-- ID: string (nullable = true)
| | | |-- Notes: string (nullable = true)
| | | |-- PhoneNumberCalled: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- SubCallNumber: string (nullable = true)
| | | |-- SupItemNumber: string (nullable = true)
| | | |-- Type: string (nullable = true)
| | | |-- UnitID: string (nullable = true)
|-- preamble: struct (nullable = true)
| |-- gateway: string (nullable = true)
| |-- product: string (nullable = true)
| |-- psap: string (nullable = true)
| |-- refDataVersion: long (nullable = true)
| |-- source: string (nullable = true)
| |-- timestamp: string (nullable = true)
| |-- uuid: string (nullable = true)
| |-- vendor: string (nullable = true)
| |-- version: string (nullable = true)
|-- raw: string (nullable = true)
但是我在定义消息的模式时出错(在流式组件中),我写了 将所有根列都转换为String的代码。
这是我写的代码
//Define the schema
val schema1 = new StructType().add("preamble",DataTypes.StringType).add("incidentMessage",DataTypes.StringType).add("raw",DataTypes.StringType)
//Apply the schema to the message (payload)
val finalResult = Df.withColumn("FinalFrame",from_json($"payload",schema1)).select($"FinalFrame.*")
现在我的数据框看起来像这样
scala> finalResult.printSchema
root
|-- incidentMessage: string (nullable = true)
|-- preamble: string (nullable = true)
|-- raw: string (nullable = true)
我现在有大量的消息,它们具有不正确的架构。我尝试将正确的架构应用于我现在拥有的消息,但是写入文件系统的那组消息具有可变的架构(事件消息),这种方法行不通(我搞砸了,应该使用Avro)
有没有办法恢复这些数据并以正确的格式保存数据?
答案 0 :(得分:0)
尽管仅创建1个字段并没有多大意义,但是您可以使用struct
函数来做到这一点:
import org.apache.spark.sql.functions.struct
df.withColumn("incidentMessage",struct($"incidentMessage"))