我正在Pyspark的结构化流中编写基于水印的代码。一切正常,但是当我从源发送一些数据时,我得到了另一个空的数据框。
在我的代码中删除了一些df.printSchema()语句。结果还是一样。
这是我的代码:-
socketStreamDF = spark.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9991) \
.load()
stocksDF = socketStreamDF.withColumn("value", split("value", ","))\
.withColumn("time", col("value")[0].cast("long").cast("timestamp"))\
.withColumn("symbol", col("value")[1]).withColumn("value", col("value")[2].cast(DoubleType()))
windowedWords = stocksDF\
.withWatermark("time", "5 seconds")\
.groupBy(window("time", "10 seconds"), stocksDF.symbol)\
.sum("value")
query = windowedWords \
.writeStream \
.outputMode("update") \
.format("console") \
.option('truncate', 'false') \
.start()
query.awaitTermination()
输入
1509672910,"aapl",500.0
预期产量
第1批
Window Symbol Sum(value)
[2017-11-03 07:05:10, 2017-11-03 07:05:20] app1 500
实际输出
第1批
Window Symbol Sum(value)
[2017-11-03 07:05:10, 2017-11-03 07:05:20] app1 500
第2批
Window Symbol Sum(value)
[Blank]
更新1
我尝试存储新的socketStreamDF的引用(具有修改的列-值,时间和符号),如下所示,但它仍然不起作用:-
socketStreamDF = socketStreamDF.withColumn("value", split("value", ","))\
.withColumn("time", col("value")[0].cast("long").cast("timestamp"))\
.withColumn("symbol", col("value")[1]).withColumn("value", col("value")[2].cast(DoubleType()))