结构化流式传输将kafka时间戳缩短至几秒钟

时间:2018-10-05 16:59:46

标签: apache-spark spark-structured-streaming

我正在使用Spark结构化流从Kafka读取数据,并希望在消息中包含Kafka时间戳:

sparkSession.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "kafka-broker:10000")
  .option("subscribe", "topicname")
  .option("includeTimestamp", true)
  .load()
  .selectExpr("CAST(topic AS STRING)", "CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
  .as[(String, String, String, Long)]

当我查看时间戳时,它会被截断从毫秒到秒。有什么方法可以让我在读取后恢复毫秒精度?

2 个答案:

答案 0 :(得分:1)

当将时间戳读取为Long值时,将发生截断。这发生在以下内容的最后一行:

sparkSession.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "kafka-broker:10000")
  .option("subscribe", "topicname")
  .option("includeTimestamp", true)
  .load()
  .selectExpr("CAST(topic AS STRING)", "CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
  .as[(String, String, String, Long)]

将最后一行更改为:

.as[(String, String, String, Timestamp)]

答案 1 :(得分:0)

我只是使用本地Kafka设置在IntelliJ中快速尝试了此操作。

如果您将时间戳字段末尾的三个点称为截断(如下面的输出所示):

Batch: 1
-------------------------------------------
+-----+----+--------+--------------------+
|topic| key|   value|           timestamp|
+-----+----+--------+--------------------+
| test|null|test-123|2018-10-07 03:10:...|
| test|null|test-234|2018-10-07 03:10:...|
+-----+----+--------+--------------------+

然后,您只需要添加以下行:

.option("truncate", false)

在您的writeStream()部分中,例如:

Dataset<Row> df = sparkSession
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", "localhost:9092")
                .option("subscribe", "test")
                .option("includeTimestamp", "true")
                .load()
                .selectExpr("CAST(topic AS STRING)", "CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(timestamp as STRING)");

try {
    df.writeStream()
          .outputMode("append")
          .format("console")
          .option("truncate", false)
          .start()
          .awaitTermination();
} catch (StreamingQueryException e) {
    e.printStackTrace();
}

此更改为我提供了输出中的完整时间戳:

Batch: 1
-------------------------------------------
+-----+----+--------+-----------------------+
|topic|key |value   |timestamp              |
+-----+----+--------+-----------------------+
|test |null|test-123|2018-10-07 03:19:50.677|
|test |null|test-234|2018-10-07 03:19:52.673|
+-----+----+--------+-----------------------+

我希望这会有所帮助。