使用DataFlow作业

时间:2017-07-25 12:10:24

标签: google-cloud-dataflow dataflow apache-beam

我想读取文件,需要根据文件字段中的日期值将其写入BigQuery Partitioned表。例如如果文件包含7月25日和26日的2个日期,则DataFlow应根据文件中存在的数据将该数据写入2个分区。

public class StarterPipeline {
  private static final Logger LOG =
      LoggerFactory.getLogger(StarterPipeline.class);

  public static void main(String[] args) {
    DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
    options.setProject("");
    options.setTempLocation("gs://stage_location/");
    Pipeline p = Pipeline.create(options);

    List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(new TableFieldSchema().setName("id").setType("STRING"));
    fields.add(new TableFieldSchema().setName("name").setType("STRING"));
    fields.add(new TableFieldSchema().setName("designation").setType("STRING"));
    fields.add(new TableFieldSchema().setName("joindate").setType("STRING"));
    TableSchema schema = new TableSchema().setFields(fields);

    PCollection<String> read = p.apply("Read Lines",TextIO.read().from("gs://hadoop_source_files/employee.txt"));

    PCollection<TableRow> rows = read.apply(ParDo.of(new DoFn<String,TableRow>(){
      @ProcessElement
      public void processElement(ProcessContext c) {
        String[] data = c.element().split(",");

        c.output(new TableRow().set("id", data[0]).set("name", data[1]).set("designation", data[2]).set("joindate", data[3]));
      }
    }));


    rows.apply(BigQueryIO.writeTableRows().to(new SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>() {
      public String getDate(String value) {
        return "project:dataset.DataFlow_Test$"+value;
      }

      @Override
      public TableDestination apply(ValueInSingleWindow<TableRow> value) {
        TableRow row = value.getValue();
        String date = getDate(row.get("joindate").toString());
        String tableSpec = date;
        String tableDescription = "";
        return new TableDestination(tableSpec, tableDescription);
      }
    }).withFormatFunction(new SerializableFunction<TableRow, TableRow>() {
      @Override
      public TableRow apply(TableRow input) {
        // TODO Auto-generated method stub
        return input;
      }
    }).withSchema(schema)
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

    p.run();
  }
}

在程序上运行时我得到以下错误:线程中的异常&#34; main&#34; org.apache.beam.sdk.Pipeline $ PipelineExecutionException:java.lang.IllegalArgumentException:表引用不在[project_id]中:[dataset_id]。[table_id]格式:引起:java.lang.IllegalArgumentException:表引用不在[project_id]:[dataset_id]。[table_id]格式。如果有任何建议,请告诉我

2 个答案:

答案 0 :(得分:1)

Beam目前不支持日期分区表。有关跟踪此功能的问题,请参阅BEAM-2390

答案 1 :(得分:0)

我可以使用以下代码基于数据中的日期将数据加载到分区表中:

public function delete(){
            $this->mpbk_grup->delete($id);
            redirect('cpbk_grup/index');
        }