Question

我正在编写一个Spark结构化流程序。我曾经在Spark批处理和Spark Streaming方面有经验，但是在结构化Streaming的情况下，我发现了一些差异。

为重现我的问题，我提供了代码段。此代码使用了data.json文件夹中存储的data文件：

[
  {"id": 77,"type": "person","timestamp": 1532609003},
  {"id": 77,"type": "person","timestamp": 1532609005},
  {"id": 78,"type": "crane","timestamp": 1532609005}
]

代码：

spark = SparkSession \
    .builder \
    .appName("Test") \
    .master("local[2]") \
    .getOrCreate()

schema = StructType([
    StructField("id", IntegerType()),
    StructField("type", StringType()),
    StructField("timestamp", LongType())
])

ds = spark \
    .readStream \
    .format("json") \
    .schema(schema) \
    .load("data/")

times = ds.select("timestamp").rdd.flatMap(lambda x: x).collect()
ids = ds.select("id").rdd.flatMap(lambda x: x).collect()
# do othe operations with "times" and  "ids"

df_persons = ds\
              .filter(func.col("type") == "person") \
              .drop("type")

query = df_persons \
    .writeStream \
    .format('console') \
    .start()

query.awaitTermination()

对于每个小批量，我必须检索times和ids以便对其进行全局操作。但是此代码失败，因为我在collect()上应用了ds。

pyspark.sql.utils.AnalysisException: u'Queries with streaming sources must be executed with writeStream.start();;\nFileSource[data/]'

我尝试将writeStream.start().awaitTermination()添加到times，但是并不能解决问题。

无法解析带有流源的查询，必须使用writeStream.start（）执行

0 个答案: