Question

我刚刚开始使用Spark结构化流式传输，因此请尝试一下。在汇总我的数据时；我无法将其写为csv文件。我已经尝试了以下不同的组合，但尚未实现写操作。

我的样本数据是

colum,values
A,12
A,233
B,232
A,67
B,5
A,89
A,100

读取为流数据帧

userSchema = StructType([
     StructField("colum", StringType()),
    StructField("values", IntegerType())
])

line2 = spark \
.readStream \
.format('csv')\
.schema(userSchema)\
 .csv("/data/location")

我正在进行聚集计算

 save=line2.groupBy("colum").count()

预期输出为

+-----+-----+
|colum|count|
+-----+-----+
|B    |2    |
|A    |5    |
|colum|1    |
+-----+-----+

方案1：

 save.writeStream.format("csv").queryName("a").outputMode("append").option("path", "/xyz/saveloc").option("checkpointLocation", "/xyz/chkptloc").start()

错误：在不带水印的流式数据帧/数据集上进行流式聚合时，不支持追加输出模式；

备注：由于数据中没有时间戳，因此无法添加水印。

方案2：

save.writeStream.format("csv").queryName("a").outputMode("complete").option("path", "/xyz/saveloc").option("checkpointLocation", "/xyz/chkptloc").start()

错误：： org.apache.spark.sql.AnalysisException：数据源csv不支持完整输出模式；

方案3：

save.writeStream.format("csv").queryName("a").outputMode("update").option("path", "/xyz/saveloc").option("checkpointLocation", "/xyz/chkptloc").start()

错误：org.apache.spark.sql.AnalysisException：数据源csv不支持更新输出模式；

方案4：

save.writeStream.format("parquet").queryName("a").outputMode("update").option("path", "/xyz/saveloc").option("checkpointLocation", "/xyz/chkptloc"").start()

错误：org.apache.spark.sql.AnalysisException：数据源镶木地板不支持更新输出模式；

场景5：

save.writeStream.format("console").queryName("a").outputMode("complete").option("path", "/xyz/saveloc").option("checkpointLocation", "/xyz/chkptloc"").start()

评论：该位置未生成任何输出。

场景6：

save.writeStream.format("memory").queryName("a").outputMode("complete").option("path", "/xyz/saveloc").option("checkpointLocation", "/xyz/chkptloc"").start()

评论：未生成任何输出。

方案7：

save.writeStream.format("memory").queryName("a").outputMode("update").option("path", "/xyz/saveloc").option("checkpointLocation", "/xyz/chkptloc"").start()