结构化流式传输除了spark-metadata(Parquet)之外不创建任何文件

时间:2017-09-20 11:32:02

标签: apache-spark google-cloud-storage spark-streaming google-cloud-dataproc spark-structured-streaming

我正在尝试运行一个结构化的流媒体应用程序,它将输出文件作为镶木地板写入Google云存储。我没有看到任何错误。但它不会将文件写入GCS位置。我只能看到spark-metadata文件夹。知道如何调试吗?

WindowDuration = "60 minutes";
SlideDuration = "10 minutes";
Data_2 = complete_data;
Data_2 = data_2.withColumn("creationDt", functions.to_timestamp( functions.from_unixtime(col(topics+"."+event_timestamp).divide(1000.0))));
Data_2 = data_2
        .withWatermark("creationDt","1 minute")
        .groupBy(col(topics+"."+keyField),functions.window(col("creationDt"), windowDuration, slideDuration),col(topics+"."+aggregateByField))
        .count();

Query_2 = data_2
        .withColumn("startwindow", col("window.start"))
        .withColumn("endwindow", col("window.end"))
        .withColumn("endwindow_date", col("window.end").cast(DataTypes.DateType))
        .writeStream()
        .format("parquet")
        .partitionBy("endwindow_date")
        .option("path",dataFile_2)
        .option("truncate", "false")
        .outputMode("append")
                .option("checkpointLocation", checkpointFile_2).start();

Query_2.awaitTermination()

1 个答案:

答案 0 :(得分:0)

我认为问题出在.outputMode("append")行。 GCS不是文件系统,不支持追加模式。

我猜这条线爆炸了,异常只是吞噬了某个地方: https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/fs/gcs/GoogleHadoopFileSystemBase.java#L1175