DataFlow管道卡在初始化tempLocation上吗?

时间:2019-01-18 17:28:35

标签: java-8 google-bigquery google-cloud-storage google-cloud-dataflow apache-beam

我是DataFlow的新手,并尝试设置一个将Google文件从CSV文件读取到BigQuery的流传输管道。管道已成功创建,并且正在读取和解析CSV文件。但是,整个管道未正确初始化。因此,没有数据加载到BigQuery中。

我使用Java 8和Apache Beam 2.5.0。

当我遍历DataFlow执行图时,我看到块Write to bigquery/BatchLoads/TempFilePrefixView/Combine.GloballyAsSingletonView/View.CreatePCollectionView/ParDo(StreamingPCollectionViewWriter)得到了输入,但是从不吐出任何输出。因此,永远不会执行以下步骤Write to bigquery/BatchLoads/TempFilePrefixView/Combine.GloballyAsSingletonView/View.CreatePCollectionView/CreateDataflowView

例如,我从代码中“受到启发” https://github.com/asaharland/beam-pipeline-examples/blob/master/src/main/java/com/harland/example/streaming/StreamingFilePipeline.java

public class MyStreamPipeline {

    private static final Logger LOG = LoggerFactory.getLogger(MyStreamPipeline.class);

    private static final int WINDOW_SIZE_SECONDS = 120;

    public interface MyOptions extends PipelineOptions, GcpOptions {
        @Description("BigQuery Table Spec project_id:dataset_id.table_id")
        ValueProvider<String> getBigQueryTableSpec();
        void setBigQueryTableSpec(ValueProvider<String> value);

        @Description("Google Cloud Storage Bucket Name")
        ValueProvider<String> getBucketUrl();
        void setBucketUrl(ValueProvider<String> value);
    }


    public static void main(String[] args) throws IOException {
        MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
        Pipeline pipeline = Pipeline.create(options);

        List<TableFieldSchema> tableFields = new ArrayList<>();
        tableFields.add(new TableFieldSchema().setName("FIELD_NAME").setType("INTEGER"));
        // ... more fields here ...
        TableSchema schema = new TableSchema().setFields(tableFields);

        pipeline
        .apply("Read CSV as string from Google Cloud Storage", 
            TextIO
                .read()
                .from(options.getBucketUrl() + "/**")
                .watchForNewFiles(
                    // Check for new files every 1 minute(s)
                    Duration.standardMinutes(1),
                    // Never stop checking for new files
                    Watch.Growth.never())
                )
        .apply(String.format("Window Into %d Second Windows", WINDOW_SIZE_SECONDS),
            Window.into(FixedWindows.of(Duration.standardSeconds(WINDOW_SIZE_SECONDS))))
        .apply("Convert CSV string to Record", 
            ParDo.of(new CsvToRecordFn()))
        .apply("Record to TableRow",
            ParDo.of(new DoFn<Record, TableRow>() {
                @ProcessElement
                public void processElement(ProcessContext c)  {
                    Record record = c.element();
                    TableRow tr = record.getTableRow();
                    c.output(tr);
                    return;}}))
        .apply("Write to bigquery", 
            BigQueryIO
                .writeTableRows()
                .to(options.getBigQueryTableSpec())
                .withSchema(schema)
                .withTimePartitioning(new TimePartitioning().setField("PARTITION_FIELD_NAME").setType("DAY"))
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

        pipeline.run();
    }
}

我使用这样的Maven命令执行管道:

mvn -f pom_2-5-0.xml clean compile exec:java \
      -Dexec.mainClass=com.organization.processor.gcs.myproject.MyStreamPipeline \
      -Dexec.args=" \
      --project=$PROJECT_ID \
      --stagingLocation=gs://$PROJECT_ID-processor/$VERSION/staging \
      --tempLocation=gs://$PROJECT_ID-processor/$VERSION/temp/ \
      --gcpTempLocation=gs://$PROJECT_ID-processor/$VERSION/gcptemp/ \
      --runner=DataflowRunner \
      --zone=$DF_ZONE \
      --region=$DF_REGION \
      --numWorkers=$DF_NUM_WORKERS \
      --maxNumWorkers=$DF_MAX_NUM_WORKERS \
      --diskSizeGb=$DF_DISK_SIZE_GB \
      --workerMachineType=$DF_WORKER_MACHINE_TYPE \
      --bucketUrl=$GCS_BUCKET_URL \
      --bigQueryTableSpec=$PROJECT_ID:$BQ_TABLE_SPEC \
      --streaming"

我真的不明白为什么管道未正确初始化以及为什么没有数据加载到BigQuery中。

任何帮助表示赞赏!

0 个答案:

没有答案