如何使Google Dataflow从输入数据写入BigQuery表名?

时间:2019-06-11 14:03:20

标签: google-bigquery google-cloud-dataflow apache-beam

我是Dataflow / Beam的新手。我正在尝试将一些数据写入BigQuery。我希望将目标表名称从上一阶段引入一个键为“表”的映射条目。但是我不知道如何通过管道将此表名传递给BigQuery。这就是我所困的地方。.任何想法下一步该怎么做?

pipeline
// ...
//////// I guess I shouldn't output TableRow here?
.apply("ToBQRow", ParDo.of(new DoFn<Map<String, String>, TableRow>() {
    @ProcessElement
    public void processElement(ProcessContext c) throws Exception {
        ////////// WHAT DO I DO WITH "table"?
        String table = c.element().get("table");
        TableRow row = new TableRow();
        // ... set some records
        c.output(row);
    }
}))
.apply(BigQueryIO.writeTableRows().to(/* ///// WHAT DO I WRITE HERE?? */)
    .withSchema(schema)
    .withWriteDisposition(
        BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
));

1 个答案:

答案 0 :(得分:1)

您可以使用DynamicDestinations

作为示例,我创建一些虚拟数据,然后将最后一个字用作键:

p.apply("Create Data", Create.of("this should go to table one",
                                 "I would like to go to table one",
                                 "please, table one",
                                 "I prefer table two",
                                 "Back to one",
                                 "My fave is one",
                                 "Rooting for two"))
.apply("Create Keys", ParDo.of(new DoFn<String, KV<String,String>>() {
    @ProcessElement
    public void processElement(ProcessContext c) {
      String[] splitBySpaces = c.element().split(" ");
      c.output(KV.of(splitBySpaces[splitBySpaces.length - 1],c.element()));
    }
  }))

,然后使用getDestination控制如何根据键和getTable将每个元素路由到不同的表,以构建完全限定的表名(在前缀之前)。如果不同的表具有不同的模式,则可以使用getSchema。最后,我们使用withFormatFunction控制要写在表中的内容:

.apply(BigQueryIO.<KV<String, String>>write()
.to(new DynamicDestinations<KV<String, String>, String>() {
    public String getDestination(ValueInSingleWindow<KV<String, String>> element) {
        return element.getValue().getKey();
    }
    public TableDestination getTable(String name) {
      String tableSpec = output + name;
        return new TableDestination(tableSpec, "Table for type " + name);
  }
    public TableSchema getSchema(String schema) {
          List<TableFieldSchema> fields = new ArrayList<>();

      fields.add(new TableFieldSchema().setName("Text").setType("STRING"));
      TableSchema ts = new TableSchema();
      ts.setFields(fields);
      return ts;
    }
})
.withFormatFunction(new SerializableFunction<KV<String, String>, TableRow>() {
    public TableRow apply(KV<String, String> row) {
    TableRow tr = new TableRow();

    tr.set("Text", row.getValue());
    return tr;
    }
 })
 .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

为了进行全面测试,我创建了以下表格:

bq mk dynamic_key
bq mk -f dynamic_key.dynamic_one Text:STRING
bq mk -f dynamic_key.dynamic_two Text:STRING

然后,在设置$PROJECT$BUCKET$TABLE_PREFIX(在我的情况下为PROJECT_ID:dynamic_key.dynamic_)变量之后,我使用以下命令运行作业:

mvn -Pdataflow-runner compile -e exec:java \
 -Dexec.mainClass=com.dataflow.samples.DynamicTableFromKey \
      -Dexec.args="--project=$PROJECT \
      --stagingLocation=gs://$BUCKET/staging/ \
      --tempLocation=gs://$BUCKET/temp/ \
      --output=$TABLE_PREFIX \
      --runner=DataflowRunner"

我们可以验证每个元素是否都进入了正确的表:

$ bq query "SELECT * FROM dynamic_key.dynamic_one"
+---------------------------------+
|              Text               |
+---------------------------------+
| please, table one               |
| Back to one                     |
| My fave is one                  |
| this should go to table one     |
| I would like to go to table one |
+---------------------------------+
$ bq query "SELECT * FROM dynamic_key.dynamic_two"
+--------------------+
|        Text        |
+--------------------+
| I prefer table two |
| Rooting for two    |
+--------------------+

完整代码here