我有一个csv文件,其中包含标题作为第一行。我正在阅读它并清理那些标题以符合BigQuery列要求。但是我需要在管道开始之前引用架构。允许BigQueryIO.Write以这种方式响应标头的最佳做法是什么?目前我的代码看起来像这样:
//create table
Table table = new Table();
// Where logically should the following line go?
TableSchema customSchema = ?
table.setSchema(customSchema);
TableReference tableRef = new TableReference();
tableRef.setDatasetId("foo_dataset");
tableRef.setProjectId("bar_project");
tableRef.setTableId("baz_table");
table.setTableReference(tableRef);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadCSV").from("gs://bucket/file.csv"))
// Detect if it's header row
.apply(ParDo.of(new ExtractHeader()))
.apply(ParDo.of(new ToTableRow())
.apply(BigQueryIO.Write.named("Write")
.to(tableRef)
// Where logically should the following line go?
.withSchema(customSchema));
p.run();
我目前正在尝试实现两个管道,看起来(粗略地)如下所示,但是Dataflow中的执行顺序是不可靠的,所以我在BQ表不存在的情况下遇到错误。
PCollection readIn = p.apply(TextIO.Read.named("ReadCSV").from("gs://bucket/file.csv"))
.apply(ParDo.of(new ExtractHeader()));
TableSchema customSchema = /* generate schema based on what I now know the headers are */
readIn.apply(ParDo.of(new ToTableRow())
.apply(BigQueryIO.Write.named("Write")
.to(tableRef)
// Where logically should the following line go?
.withSchema(customSchema));
p.run();
答案 0 :(得分:2)
此功能(动态架构)现在正在审核中https://github.com/apache/beam/pull/2609(我正在审核它)。您可以尝试正在进行的PR,但请注意,由于审核,其API可能会有所改变。提交公关时我会更新这个答案。