Apache Beam 2.9.0
我已经建立了一个从BigQuery中提取数据并对其进行一系列转换的管道。这些选项使用ValueProvider
附加了开始日期:
ValueProvider<String> getStartTime();
void setStartTime(ValueProvider<String> startTime);
然后我用BigQueryIO
提取数据(为了清楚地说明发生了什么而作了一些改动):
BigQueryIO.read(
(SerializableFunction<SchemaAndRecord, AggregatedRowRecord>)
input -> new BigQueryParser().apply(input.getRecord()))
.withoutValidation()
.withTemplateCompatibility()
.fromQuery(
ValueProvider.NestedValueProvider.of(
opts.getStartTime(),
(SerializableFunction<String, String>)
input -> {
Instant instant = Instant.parse(input);
return String.format(
<large SQL statement with a %s in it>,
String.format(
"%d_%d_%d",
instant.get(ChronoField.YEAR),
instant.get(ChronoField.MONTH_OF_YEAR),
instant.get(ChronoField.DAY_OF_MONTH)));
}))
.withCoder(<coder for AggregatedRowRecords>)
.usingStandardSql()
然后通常将其添加到管道(p.apply(<above>)
)。
现在我运行它:
--project=<project> \
--tempLocation=<directory> \
--stagingLocation=<directory> \
--network=dataflow \
--subnetwork=<subnetwork> \
--defaultWorkerLogLevel=DEBUG
--appName=<name>
--runner=DirectRunner
这会导致以下错误:
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalStateException: Value only available at runtime, but accessed from a non-runtime context: RuntimeValueProvider{propertyName=startTime, default=null}
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:332)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:302)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:197)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:64)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:299)
at <class>.main(<class>.java:<>)
Caused by: java.lang.IllegalStateException: Value only available at runtime, but accessed from a non-runtime context: RuntimeValueProvider{propertyName=startTime, default=null}
at org.apache.beam.sdk.options.ValueProvider$RuntimeValueProvider.get(ValueProvider.java:228)
at org.apache.beam.sdk.options.ValueProvider$NestedValueProvider.get(ValueProvider.java:131)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.createBasicQueryConfig(BigQueryQuerySource.java:230)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.dryRunQueryIfNeeded(BigQueryQuerySource.java:175)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTableToExtract(BigQueryQuerySource.java:115)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:102)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$TypedRead$2.processElement(BigQueryIO.java:783)
NestedValueProvider
的使用来自this example on setting up templates:
The user provides a substring for a BigQuery query, such as a specific date. The transform uses the substring to create the full query. Calling .get() returns the full query.
但是,删除值提供者逻辑似乎没有帮助。从ValueProvider
部分中完全删除withQuery
可以很好地工作,但是无法通过选项进行设置。
答案 0 :(得分:0)
该异常向您解释了问题,Apache Beam首先构建管道和类,然后开始在管道中运行数据,在此阶段,您无法访问选项,这只是用于构建管道的元数据
克服它的方法是创建一个ParDo函数/ PTransform,它将在构造函数中获取所需的选项作为参数,然后可以按其逻辑对其进行访问。
请参阅示例:(我的用例,我最近几天遇到同样的问题)
管道:
HistoryProcessingOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(HistoryProcessingOptions.class);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(SourceRead.of(options.getSourceBigQueryTable().get(),
options.getSourceBigQueryDataset().get(),
options.getSourceBigQueryProject().get(),
options.getFromDate().get(),
options.getToDate().get()
))
变压器本身:
public class SourceRead extends PTransform<PBegin, PCollection<TableRow>> {
private String sourceBigQueryTable;
private String sourceBigQueryDataset;
private String sourceBigQueryProject;
private String formDate;
private String toDate;
private static Logger logger = LoggerFactory.getLogger(SourceRead.class);
public SourceRead(String sourceBigQueryTable, String sourceBigQueryDataset, String sourceBigQueryProject, String formDate, String toDate) {
this.sourceBigQueryTable = sourceBigQueryTable;
this.sourceBigQueryDataset = sourceBigQueryDataset;
this.sourceBigQueryProject = sourceBigQueryProject;
this.formDate = formDate;
this.toDate = toDate;
}
public static SourceRead of(String sourceBigQueryTable, String sourceBigQueryDataset, String sourceBigQueryProject, String yearToLoad, String dateToLoad) {
return new SourceRead(sourceBigQueryTable, sourceBigQueryDataset, sourceBigQueryProject, yearToLoad, dateToLoad);
}
@Override
public PCollection<TableRow> expand(PBegin input) {
String query = "SELECT * FROM TABLE_DATE_RANGE([" + sourceBigQueryProject + ":"+sourceBigQueryDataset+"."+sourceBigQueryTable+"],"
+ "TIMESTAMP('" + formDate + "'),"
+ "TIMESTAMP('" + toDate + "'))";
logger.info("query is"+ query);
return input.apply(BigQueryIO.readTableRows()
.fromQuery(query));
}