Question

我有一个csv文件，其中包含标题作为第一行。我正在阅读它并清理那些标题以符合BigQuery列要求。但是我需要在管道开始之前引用架构。允许BigQueryIO.Write以这种方式响应标头的最佳做法是什么？目前我的代码看起来像这样：

//create table Table table = new Table(); // Where logically should the following line go? TableSchema customSchema = ? table.setSchema(customSchema); TableReference tableRef = new TableReference(); tableRef.setDatasetId("foo_dataset"); tableRef.setProjectId("bar_project"); tableRef.setTableId("baz_table"); table.setTableReference(tableRef); Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.named("ReadCSV").from("gs://bucket/file.csv")) // Detect if it's header row .apply(ParDo.of(new ExtractHeader())) .apply(ParDo.of(new ToTableRow()) .apply(BigQueryIO.Write.named("Write") .to(tableRef) // Where logically should the following line go? .withSchema(customSchema)); p.run();

我目前正在尝试实现两个管道，看起来（粗略地）如下所示，但是Dataflow中的执行顺序是不可靠的，所以我在BQ表不存在的情况下遇到错误。

PCollection readIn = p.apply(TextIO.Read.named("ReadCSV").from("gs://bucket/file.csv")) .apply(ParDo.of(new ExtractHeader())); TableSchema customSchema = /* generate schema based on what I now know the headers are */ readIn.apply(ParDo.of(new ToTableRow()) .apply(BigQueryIO.Write.named("Write") .to(tableRef) // Where logically should the following line go? .withSchema(customSchema)); p.run();

Answer 1

此功能（动态架构）现在正在审核中https://github.com/apache/beam/pull/2609（我正在审核它）。您可以尝试正在进行的PR，但请注意，由于审核，其API可能会有所改变。提交公关时我会更新这个答案。

如何在BigQueryIO.Write之前读取和转换CSV标头？

1 个答案: