如何在完整输出模式下将流聚合保存到Parquet?

时间:2017-09-26 07:46:23

标签: scala apache-spark parquet spark-structured-streaming

我已使用完整模式在流数据帧上应用聚合。为了在本地保存数据帧,我实现了foreach接收器。我能够以文本形式保存数据框。但我需要以Parquet格式保存它。

val writerForText = new ForeachWriter[Row] {
    var fileWriter: FileWriter = _

    override def process(value: Row): Unit = {
      fileWriter.append(value.toSeq.mkString(","))
    }

    override def close(errorOrNull: Throwable): Unit = {
      fileWriter.close()
    }

    override def open(partitionId: Long, version: Long): Boolean = {
      FileUtils.forceMkdir(new File(s"src/test/resources/${partitionId}"))
      fileWriter = new FileWriter(new File(s"src/test/resources/${partitionId}/temp"))
      true

    }
  }

val columnName = "col1"
frame.select(count(columnName),count(columnName),min(columnName),mean(columnName),max(columnName),first(columnName), last(columnName), sum(columnName))
              .writeStream.outputMode(OutputMode.Complete()).foreach(writerForText).start()

我怎样才能实现这一目标? 提前谢谢!

1 个答案:

答案 0 :(得分:-1)

  

为了在本地保存数据帧,我实现了foreach接收器。我能够以文本形式保存数据框。但我需要以镶木地板格式保存它。

保存流式数据集时的默认格式为... 镶木地板。话虽如此,您不必使用相当先进的foreach接收器,而只需使用parquet

查询可以如下:

scala> :type in
org.apache.spark.sql.DataFrame

scala> in.isStreaming
res0: Boolean = true

in.writeStream.
  option("checkpointLocation", "/tmp/checkpoint-so").
  start("/tmp/parquets")