时间:2019-02-04 10:19:42

标签: scala apache-flink parquet

我在HDFS上有一个实木复合地板文件。每天都会用新的覆盖它。我的目标是使用DataStream API在Flink Job中作为连续不断地发出此实木复合地板文件-当它更改时。 最终目标是在广播状态下使用文件内容,但这超出了此问题的范围。

  1. 连续处理文件,有一个非常有用的API:Data-sources关于数据源。更具体地说, FileProcessingMode.PROCESS_CONTINUOUSLY :这正是我所需要的。这适用于读取/监视文本文件,没问题,但不适用于拼花文件:
// Partial version 1: the raw file is processed continuously
val path: String = "hdfs://hostname/path_to_file_dir/"
val textInputFormat: TextInputFormat = new TextInputFormat(new Path(path))
// monitor the file continuously every minute
val stream: DataStream[String] = streamExecutionEnvironment.readFile(textInputFormat, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 60000)
  1. 要处理镶木地板文件,我可以使用以下API使用 Hadoop输入格式using-hadoop-inputformats。但是,通过此API没有FileProcessingMode参数,并且该参数仅处理一次文件:
// Partial version 2: the parquet file is only processed once
val parquetPath: String = "/path_to_file_dir/parquet_0000"
// raw text format
val hadoopInputFormat: HadoopInputFormat[Void, ArrayWritable] = HadoopInputs.readHadoopFile(new MapredParquetInputFormat(), classOf[Void], classOf[ArrayWritable], parquetPath)
val stream: DataStream[(Void, ArrayWritable)] = streamExecutionEnvironment.createInput(hadoopInputFormat).map { record =>
  // process the record here ...

我想以某种方式组合这两个API,以通过DataStream API连续处理Parquet文件。你们有没有尝试过类似的东西?

class ParquetSourceFunction extends SourceFunction[Int] {
  private var isRunning = true

  override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
    while (isRunning) {
      val path = new Path("path_to_parquet_file")
      val conf = new Configuration()

      val readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER)
      val metadata = readFooter.getFileMetaData
      val schema = metadata.getSchema
      val parquetFileReader = new ParquetFileReader(conf, metadata, path, readFooter.getBlocks, schema.getColumns)
      var pages: PageReadStore = null
      try {
        while ({ pages = parquetFileReader.readNextRowGroup; pages != null }) {
          val rows = pages.getRowCount
          val columnIO = new ColumnIOFactory().getColumnIO(schema)
          val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema))
          (0L until rows).foreach { _ =>
            val group = recordReader.read()
            val my_integer = group.getInteger("field_name", 0)

      // do whatever logic suits you to stop "watching" the file

  override def cancel(): Unit = isRunning = false


val dataStream: DataStream[Int] = streamExecutionEnvironment.addSource(new ParquetProtoSourceFunction)
// do what you want with your new datastream