我有一个像这样设置的pyspark结构流式python应用程序
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("data streaming app")\
.getOrCreate()
data_raw = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "kafkahost:9092")\
.option("subscribe", "my_topic")\
.load()
query = data_raw.writeStream\
.outputMode("append")\
.format("console")\
.option("truncate", "false")\
.trigger(processingTime="5 seconds")\
.start()\
.awaitTermination()
所有显示的就是这个
+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
+---+-----+-----+---------+------+---------+-------------+
19/03/04 22:00:50 INFO streaming.StreamExecution: Streaming query made progress: {
"id" : "ab24bd30-6e2d-4c2a-92a2-ddad66906a5b",
"runId" : "29592d76-892c-4b29-bcda-f4ef02aa1390",
"name" : null,
"timestamp" : "2019-03-04T22:00:49.389Z",
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 852,
"getBatch" : 180,
"getOffset" : 135,
"queryPlanning" : 107,
"triggerExecution" : 1321,
"walCommit" : 27
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[my_topic]]",
"startOffset" : null,
"endOffset" : {
"my_topic" : {
"0" : 303
}
},
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@74fad4a5"
}
}
如您所见,my_topic
上有303条消息,但我无法显示。其他信息包括我正在使用融合的Kafka JDBC连接器查询oracle数据库并将行存储到kafka主题中。我有一个avro模式注册表设置。如果需要,我也将共享这些属性文件。
有人知道发生了什么吗?
答案 0 :(得分:3)
作为流应用程序,此Spark结构流仅在消息发布后立即读取它们。为了测试目的,我想做的是阅读本主题中的所有内容。为了做到这一点,您要做的就是在append
(即readStream
)中添加一个选项。
option("startingOffsets", "earliest")