我是DataFlow的新手,并尝试设置一个将Google文件从CSV文件读取到BigQuery的流传输管道。管道已成功创建,并且正在读取和解析CSV文件。但是,整个管道未正确初始化。因此,没有数据加载到BigQuery中。
我使用Java 8和Apache Beam 2.5.0。
当我遍历DataFlow执行图时,我看到块Write to bigquery/BatchLoads/TempFilePrefixView/Combine.GloballyAsSingletonView/View.CreatePCollectionView/ParDo(StreamingPCollectionViewWriter)
得到了输入,但是从不吐出任何输出。因此,永远不会执行以下步骤Write to bigquery/BatchLoads/TempFilePrefixView/Combine.GloballyAsSingletonView/View.CreatePCollectionView/CreateDataflowView
。
例如,我从代码中“受到启发” https://github.com/asaharland/beam-pipeline-examples/blob/master/src/main/java/com/harland/example/streaming/StreamingFilePipeline.java
public class MyStreamPipeline {
private static final Logger LOG = LoggerFactory.getLogger(MyStreamPipeline.class);
private static final int WINDOW_SIZE_SECONDS = 120;
public interface MyOptions extends PipelineOptions, GcpOptions {
@Description("BigQuery Table Spec project_id:dataset_id.table_id")
ValueProvider<String> getBigQueryTableSpec();
void setBigQueryTableSpec(ValueProvider<String> value);
@Description("Google Cloud Storage Bucket Name")
ValueProvider<String> getBucketUrl();
void setBucketUrl(ValueProvider<String> value);
}
public static void main(String[] args) throws IOException {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline pipeline = Pipeline.create(options);
List<TableFieldSchema> tableFields = new ArrayList<>();
tableFields.add(new TableFieldSchema().setName("FIELD_NAME").setType("INTEGER"));
// ... more fields here ...
TableSchema schema = new TableSchema().setFields(tableFields);
pipeline
.apply("Read CSV as string from Google Cloud Storage",
TextIO
.read()
.from(options.getBucketUrl() + "/**")
.watchForNewFiles(
// Check for new files every 1 minute(s)
Duration.standardMinutes(1),
// Never stop checking for new files
Watch.Growth.never())
)
.apply(String.format("Window Into %d Second Windows", WINDOW_SIZE_SECONDS),
Window.into(FixedWindows.of(Duration.standardSeconds(WINDOW_SIZE_SECONDS))))
.apply("Convert CSV string to Record",
ParDo.of(new CsvToRecordFn()))
.apply("Record to TableRow",
ParDo.of(new DoFn<Record, TableRow>() {
@ProcessElement
public void processElement(ProcessContext c) {
Record record = c.element();
TableRow tr = record.getTableRow();
c.output(tr);
return;}}))
.apply("Write to bigquery",
BigQueryIO
.writeTableRows()
.to(options.getBigQueryTableSpec())
.withSchema(schema)
.withTimePartitioning(new TimePartitioning().setField("PARTITION_FIELD_NAME").setType("DAY"))
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
pipeline.run();
}
}
我使用这样的Maven命令执行管道:
mvn -f pom_2-5-0.xml clean compile exec:java \
-Dexec.mainClass=com.organization.processor.gcs.myproject.MyStreamPipeline \
-Dexec.args=" \
--project=$PROJECT_ID \
--stagingLocation=gs://$PROJECT_ID-processor/$VERSION/staging \
--tempLocation=gs://$PROJECT_ID-processor/$VERSION/temp/ \
--gcpTempLocation=gs://$PROJECT_ID-processor/$VERSION/gcptemp/ \
--runner=DataflowRunner \
--zone=$DF_ZONE \
--region=$DF_REGION \
--numWorkers=$DF_NUM_WORKERS \
--maxNumWorkers=$DF_MAX_NUM_WORKERS \
--diskSizeGb=$DF_DISK_SIZE_GB \
--workerMachineType=$DF_WORKER_MACHINE_TYPE \
--bucketUrl=$GCS_BUCKET_URL \
--bigQueryTableSpec=$PROJECT_ID:$BQ_TABLE_SPEC \
--streaming"
我真的不明白为什么管道未正确初始化以及为什么没有数据加载到BigQuery中。
任何帮助表示赞赏!