Question

我在Scala中有一个小型的spark（ver 2.3）程序，该程序从Kafka读取数据，并将其作为实木复合地板文件写入Hdfs。

这里是保存的部分：

df.writeStream.
  partitionBy("dayInWeek").
  format("parquet").
  outputMode(OutputMode.Append).
  option("path", "parquet-output-dir").
  option("checkpointLocation", "checkpoint-dir").
  trigger(Trigger.ProcessingTime(20.seconds)).  
  start().
  awaitTermination()

这样，我得到了很多小的实木复合地板文件（每个约为6kb），因为它每20秒执行一次保存。

在HDFS中使用许多小文件而不是少量大文件是否可以？

如果没有，如何保存流数据并将其保存到更大的文件中（而又不扩大Trigger.ProcessingTime）？

也是

数据被保存在HDFS中，例如：

 ...\some_path\dayInWeek=monday\parquet-files

如何从HDFS读取特定dayInWeek的数据？

sparkSession.read.parquet("...\some_path") // this return data from all days
//sparkSession.read.parquet("...\some_path\dayInWeek=monday") // this result in error
sparkSession.read.parquet("...\some_path").filter($"dayInWeek" === "monday") // this will also load all days before filtering

如何使用结构化流将理想情况下的实木复合地板文件保存到HDFS，然后再读取

0 个答案: