我将遵循Apache Spark权威指南中的代码。我遇到一个问题,当我有注释行代码“ awaitTermination()”时,以下代码在Jupyter Notebook中无法打印结果。 在代码中包含“ awaitTermination()”的情况下,Jupyter Kernel处于繁忙状态,并且可能长时间保持繁忙状态。
没有“ awaitTermination”,代码可以正常工作。
有人可以解释这种行为。我该如何克服?
static = spark.read.json(r"/resources/activity-data/")
dataSchema = static.schema
streaming = (spark
.readStream
.schema(dataSchema)
.option("maxFilesPerTrigger", 1)
.json(r"/resources/activity-data/")
)
activityCounts = streaming.groupBy("gt").count()
spark.conf.set("spark.sql.shuffle.partitions", 5)
activityQuery = (activityCounts
.writeStream
.queryName("activity_counts")
.format("memory")
.outputMode("complete")
.start()
)
#activityQuery.awaitTermination()
#activityQuery.stop()
from time import sleep
for x in range(5):
spark.table("activity_counts").show()
sleep(1)
答案 0 :(得分:0)
是;请参阅此文档作为参考(https://docs.databricks.com/spark/latest/structured-streaming/production.html),Spark TDG的第352页也对此进行了说明。
火花流作业是连续的应用程序,在生产中需要 activityQuery.awaitTermination(),因为它可以防止驱动器进程在流处于活动状态时(在后台)终止。
如果驱动程序被杀死,那么应用程序也因此被杀死,因此 activityQuery.awaitTermination()有点像故障保护。如果您想在Jupyter中关闭流,则可以运行 activityQuery.stop()以重置查询以进行测试...希望对您有所帮助。
activityDataSample = 'path/to/data'
spark.conf.set("spark.sql.shuffle.partitions", 8)
static = spark.read.json(activityDataSample)
dataSchema = static.schema
static.printSchema()
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1)\
.json(activityDataSample)
activityCounts = streaming.groupBy("gt").count()
activityQuery = activityCounts.writeStream.queryName("activity_counts")\
.format("memory").outputMode("complete")\
.start()
# simulates a continuous stream for testing (cntrl-C to kill app)
'''
activityQuery = activityCounts.writeStream.queryName("activity_counts")\
.format("console").outputMode("complete")\
.start()
activityQuery.awaitTermination()
'''
spark.streams.active # query stream is active
[<pyspark.sql.streaming.StreamingQuery at 0x28a4308d320>]
from time import sleep
for x in range(3):
spark.sql("select * from activity_counts").show(3)
sleep(2)
+---+-----+
| gt|count|
+---+-----+
+---+-----+
+--------+-----+
| gt|count|
+--------+-----+
| bike|10796|
| null|10449|
|stairsup|10452|
+--------+-----+
only showing top 3 rows
+--------+-----+
| gt|count|
+--------+-----+
| bike|10796|
| null|10449|
|stairsup|10452|
+--------+-----+
only showing top 3 rows
activityQuery.stop() # stop query stream
spark.streams.active # no active streams anymore
[]