无法解析带有流源的查询,必须使用writeStream.start()执行

时间:2018-11-23 15:21:22

标签: python apache-spark pyspark spark-structured-streaming

我正在编写一个Spark结构化流程序。我曾经在Spark批处理和Spark Streaming方面有经验,但是在结构化Streaming的情况下,我发现了一些差异。

为重现我的问题,我提供了代码段。此代码使用了data.json文件夹中存储的data文件:

[
  {"id": 77,"type": "person","timestamp": 1532609003},
  {"id": 77,"type": "person","timestamp": 1532609005},
  {"id": 78,"type": "crane","timestamp": 1532609005}
]

代码:

spark = SparkSession \
    .builder \
    .appName("Test") \
    .master("local[2]") \
    .getOrCreate()

schema = StructType([
    StructField("id", IntegerType()),
    StructField("type", StringType()),
    StructField("timestamp", LongType())
])

ds = spark \
    .readStream \
    .format("json") \
    .schema(schema) \
    .load("data/")

times = ds.select("timestamp").rdd.flatMap(lambda x: x).collect()
ids = ds.select("id").rdd.flatMap(lambda x: x).collect()
# do othe operations with "times" and  "ids"

df_persons = ds\
              .filter(func.col("type") == "person") \
              .drop("type")

query = df_persons \
    .writeStream \
    .format('console') \
    .start()

query.awaitTermination()

对于每个小批量,我必须检索timesids以便对其进行全局操作。 但是此代码失败,因为我在collect()上应用了ds

pyspark.sql.utils.AnalysisException: u'Queries with streaming sources must be executed with writeStream.start();;\nFileSource[data/]'

我尝试将writeStream.start().awaitTermination()添加到times,但是并不能解决问题。

0 个答案:

没有答案