来自Kafka的流正在混合如下所示的模式
{
"header":{
"batch_id":"CustomerService_0_667_742",
"entity":"ActionItem",
"time":1536419113,
"key":[
{
"actionItemKey":"\"536870923\""
}
],
"message_type":"transmessage"
},
"body":{
"actionItemKey":"536870923",
"actionItemSourceId":"536870923",
"taskId":"1807271",
"actionItemTitle":"test",
"activeFlag":"1",
"startDate":"2018-07-27T07:44:57Z",
"dueDate":"2018-08-03T07:44:57Z",
"completionDate":"1753-01-01T05:50:36Z",
"originatorEmployeeKey":"10001",
"ownerEmployeeKey":"10001",
"actionItemTypeKey":"288",
"actionItemStatusKey":"32",
"actionItemPriorityKey":"296",
"customerServiceActivityStateKey":"Not Started",
"dml_action":"U",
"source_update_time__":"2018-09-08T15:05:13Z",
"source_query_time__":"2018-09-08T15:05:13Z",
"sourceSystemId":""
}
}
{
"header":{
"batch_id":"Invoice_0_39550_48481",
"entity":"TaxRate",
"time":1536419007,
"key":[
{
"taxRateKey":"\"1\""
}
],
"message_type":"refmessage"
},
"body":{
"taxCodeKey":"TX1",
"taxRate":5.0000,
"taxRateKey":"1",
"taxRuleCode":"R1",
"taxAuthorityCode":"COUNTRY",
"taxTypeId":"VAT",
"effectiveDate":"2000-01-01T06:00:00Z",
"taxRateId":"1",
"dml_action":"U",
"source_update_time__":"2018-09-08T15:03:27Z",
"source_query_time__":"2018-09-08T15:03:27Z",
"sourceSystemId":""
}
}
我们有200多个具有不同架构的表,我不想为每个表在Spark中指定架构。我想使用Spark结构化流将这些表保存到按实体名称和current_date分区的HDFS中,例如json格式,下面是使用的代码段。
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", DevConfig.BrokerHosts)
.option("subscribe", "topic1")
.load()
.selectExpr("CAST(value AS STRING)")
.as[String]
以字符串形式读取值并将其存储到hdfs后,这些值将用双引号引起来。
我们如何以他在Kafka中的格式存储JSON?