我正在编写一个Spark结构化流程序。我曾经在Spark批处理和Spark Streaming方面有经验,但是在结构化Streaming的情况下,我发现了一些差异。
为重现我的问题,我提供了代码段。此代码使用了data.json
文件夹中存储的data
文件:
[
{"id": 77,"type": "person","timestamp": 1532609003},
{"id": 77,"type": "person","timestamp": 1532609005},
{"id": 78,"type": "crane","timestamp": 1532609005}
]
代码:
spark = SparkSession \
.builder \
.appName("Test") \
.master("local[2]") \
.getOrCreate()
schema = StructType([
StructField("id", IntegerType()),
StructField("type", StringType()),
StructField("timestamp", LongType())
])
ds = spark \
.readStream \
.format("json") \
.schema(schema) \
.load("data/")
times = ds.select("timestamp").rdd.flatMap(lambda x: x).collect()
ids = ds.select("id").rdd.flatMap(lambda x: x).collect()
# do othe operations with "times" and "ids"
df_persons = ds\
.filter(func.col("type") == "person") \
.drop("type")
query = df_persons \
.writeStream \
.format('console') \
.start()
query.awaitTermination()
对于每个小批量,我必须检索times
和ids
以便对其进行全局操作。
但是此代码失败,因为我在collect()
上应用了ds
。
pyspark.sql.utils.AnalysisException: u'Queries with streaming sources must be executed with writeStream.start();;\nFileSource[data/]'
我尝试将writeStream.start().awaitTermination()
添加到times
,但是并不能解决问题。