使用Kafka对Json数据进行Spark sql流式传输:函数from_json无法解析来自kafka主题的多行json

时间:2019-01-22 05:24:53

标签: apache-spark apache-kafka spark-structured-streaming spark-streaming-kafka

在这里,我正在将JSON数据从“测试”主题发送到kafka,将架构提供给json,进行一些转换并将其打印在控制台上。 这是代码:-

val kafkadata = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("zookeeper.connect", "localhost:2181")
    .option("subscribe", "test")
    .option("startingOffsets", "earliest")
    .option("max.poll.records", 10)
    .option("failOnDataLoss", false)
    .load()



val schema1 = new StructType()
    .add("id_sales_order", StringType)                               
    .add("item_collection",                                         
    MapType(                                             
      StringType,
      new StructType()
        .add("id", LongType)
        .add("ip", StringType)
        .add("description", StringType)
        .add("temp", LongType)
        .add("c02_level", LongType)
        .add("geo",
          new StructType()
            .add("lat", DoubleType)
            .add("long", DoubleType)
        )
    )
  )



val df = kafkadata.selectExpr("cast (value as string) as 
           json")
           .select(from_json($"json",
schema=schema1).as("data"))
.select($"data.id_sales_order",explode($"data.item_collection"))




 val query = df.writeStream
    .outputMode("append")
    .queryName("table")
    .format("console")
    .start()
  query.awaitTermination()
  spark.stop()

我通过两种方式向kafka发送数据:-

1)单行json:-

 {"id_sales_order": "2", "item_collection": {"2": {"id": 10,"ip": "68.28.91.22","description": "Sensor attached to the container ceilings","temp":35,"c02_level": 1475,"geo": { "lat":38.00, "long":97.00}}}}

It is giving me output
+--------------+---+--------------------+
|id_sales_order|key|               value|
+--------------+---+--------------------+
|             2|  2|[10,68.28.91.22,S...|
+--------------+---+--------------------+

2)多行json:-

{
  "id_sales_order": "2",
  "item_collection": {
    "2": {
      "id": 10,
      "ip": "68.28.91.22",
      "description": "Sensor attached to the container ceilings",
      "temp":35,
      "c02_level": 1475,
      "geo":
        { "lat":38.00, "long":97.00}
    }
}
}

It is not giving me any output.
+--------------+---+-----+
|id_sales_order|key|value|
+--------------+---+-----+
+--------------+---+-----+

来自消息来源的Json就像第二名。

从kafka读取流数据时如何处理json? 我认为问题可能是from_json函数无法理解多行json。

0 个答案:

没有答案