无法使用spark和kafka查询json结构化数据

时间:2019-05-24 04:30:13

标签: apache-spark apache-kafka apache-spark-sql spark-streaming spark-structured-streaming

全部。

我正在使用带有kafka2.2的spark2.4.0(都与scala 2.11一起使用)处理json流数据。我在这里关注了一些链接:

example1

example2

我在kafka中的数据格式(随机数据)是json:

{"yaw": 0, "height": 2053.800174349187, "timestamp": "1555465965", "v": 1, "longitude": "121.645261", "jet_number": 15, "acc": 1, "latitude": "30.050000"}
{"yaw": 0, "height": 2023.4573189529592, "timestamp": "1555465966", "v": 1, "longitude": "87.656227", "jet_number": 11, "acc": 1, "latitude": "30.050000"}
{"yaw": 0, "height": 2005.5774022979028, "timestamp": "1555465967", "v": 1, "longitude": "124.613970", "jet_number": 3, "acc": 1, "latitude": "30.050000"}
{"yaw": 0, "height": 2074.936351669867, "timestamp": "1555465968", "v": 1, "longitude": "131.765794", "jet_number": 15, "acc": 1, "latitude": "30.050000"}
{"yaw": 0, "height": 2030.5305980070775, "timestamp": "1555465969", "v": 1, "longitude": "126.936592", "jet_number": 12, "acc": 1, "latitude": "30.050000"}
{"yaw": 0, "height": 2024.540075254924, "timestamp": "1555465970", "v": 1, "longitude": "121.432735", "jet_number": 12, "acc": 1, "latitude": "30.050000"}

我的代码段:

import org.apache.spark.sql.types.{DataTypes,StructType}

val schema = new StructType()
             .add("acc",DataTypes.IntegerType)
             .add("v",DataTypes.IntegerType)
             .add("longitude",DataTypes.StringType)
             .add("jet_number",DataTypes.IntegerType)
             .add("timestamp",DataTypes.StringType)
             .add("latitude",DataTypes.StringType)
             .add("height",DataTypes.IntegerType)
             .add("yaw",DataTypes.IntegerType)

val df = spark.readStream
              .format("kafka")
              .option("kafka.bootstrap.servers", "ip:9092")
              .option("kafka.partition.assignment.strategy","org.apache.kafka.clients.consumer.RangeAssignor")
              .option("subscribe", "test")
              .load()

df.printSchema

val jetDF = df.selectExpr("CAST(value AS STRING)")

jetDF.printSchema

val jdf = jetDF.select(from_json($"value", schema).as("data")).select("data.*")

jdf.printSchema

jdf.writeStream
    .outputMode("append")
    .format("console")
    .start()
    .awaitTermination()

运行此代码后,我无法在此处获得正确的输出:

Spark context available as 'sc' (master = local[*], app id = local-1558671649422).
Spark session available as 'spark'.
root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

root
 |-- value: string (nullable = true)

root
 |-- acc: integer (nullable = true)
 |-- v: integer (nullable = true)
 |-- longitude: string (nullable = true)
 |-- jet_number: integer (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- height: integer (nullable = true)
 |-- yaw: integer (nullable = true)

-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+---------+----------+---------+--------+------+---+
|acc|  v|longitude|jet_number|timestamp|latitude|height|yaw|
+---+---+---------+----------+---------+--------+------+---+
+---+---+---------+----------+---------+--------+------+---+

-------------------------------------------                                     
Batch: 1
-------------------------------------------
+----+----+---------+----------+---------+--------+------+----+
| acc|   v|longitude|jet_number|timestamp|latitude|height| yaw|
+----+----+---------+----------+---------+--------+------+----+
|null|null|     null|      null|     null|    null|  null|null|
|null|null|     null|      null|     null|    null|  null|null|
|null|null|     null|      null|     null|    null|  null|null|
|null|null|     null|      null|     null|    null|  null|null|

有人可以帮我吗?我挣扎了两天。

:(

0 个答案:

没有答案