当我运行Beam DataFlow(用Java编写)以将数据输入到Google BigQuery时遇到问题。这个问题发生在我的开发环境和生产环境中。
我正在通过单独线程上的Beam管道处理多个数据项。我正在使用ParDo
进行处理,该apply
在PCollection
上使用transformedData
.apply("Load fact data",
BigQueryIO.<ValidatedDataRecord>write()
.to(new LoadDataFact.DynamicFactTableDestination(dataType.label))
.withFormatFunction(new LoadDataFact.FactSerializationFn())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
转换接收到的数据。对于转换后的数据,然后尝试将其写入Google BigQuery。
我正在尝试使用如下所示的例程将转换后的数据写入BigQuery:
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED
这里重要的是,我正在使用以下内容:
Exception thrown in class : com.myOrg.myPackage.myClass Error : java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table my-project:my-dataset.my-table",
"reason" : "notFound"
} ],
"message" : "Not found: Table my-project:my-dataset.my-table",
"status" : "NOT_FOUND"
}
...因为我希望创建一个尚不存在的新表
问题在于,当新数据成为数据流的一部分,并且该数据需要该数据的新数据集时,我经常(但不总是)遇到以下404错误:
CreateTables.class
我研究了底层Beam SDKs Java Core库中发生的事情。发生的事情是,当处理第一个新数据项时,“ Beam SDKs Java Core”库中的CreateTables.createdTables
应该尝试创建一个新表,如果成功,它将向静态{{1 }}收集新表,以指示该表已创建。在创建表之后似乎会发生404错误(尽管有时,表完全无法创建)。我不确定是什么原因导致了此行为,或者是什么触发了404错误(错误消息没有提供太多信息)。 BigQuery中的此类错误通常是因为在访问表的任何阶段都不存在该表。
CreateTables.class
是以下Beam SDK SDK Java Core库的一部分:
C:\Users\my.username\.m2\repository\org\apache\beam\beam-sdks-java-io-google-cloud-platform\2.5.0\beam-sdks-java-io-google-cloud-platform-2.5.0.jar!\org\apache\beam\sdk\io\gcp\bigquery\CreateTables.class
我在Beam Jira上提出了一个错误-https://issues.apache.org/jira/browse/BEAM-7195
我尝试将Beam SDK的Java Core库更新为v 2.12.0,但是由于某种原因,数据流完全停止工作。
有没有一种方法可以解决此问题?我正在使用一个自定义类重写Beam用于处理的DynamicDestinations
类(这是文档指定应实现动态数据目标的方式-请参见https://beam.apache.org/documentation/io/built-in/google-bigquery/#using-dynamic-destinations)。
我的自定义类如下所示,在处理每个数据项期间调用getTable
方法,以确定应将数据添加到的表:
public class LoadDataFact {
public static class DynamicFactTableDestination extends DynamicDestinations<ValidatedDataRecord, String> {
private static final long serialVersionUID = -1234561111111123456L;
private static final String projectID = "my-project";
private String dataType = "none";
private String elementDuration = "unknown";
public DynamicFactTableDestination(String dataType) {
this.dataType = dataType;
}
@Override
public String getDestination(ValueInSingleWindow<ValidatedDataRecord> element) {
return element.getValue().DatasetName;
}
@Override
public TableDestination getTable(String destination) {
try {
return new TableDestination(new TableReference()
.setProjectId(projectID)
.setDatasetId(destination)
.setTableId(String.format("data_for_%s",this.dataType)), "Data staging table",
new TimePartitioning()
.setType("DAY")
.setField("created_date_time"));
}
catch (Exception ex) {
System.out.println("Error " + ex.getMessage());
}
return null;
}
这是我的pom.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.myOrg.myNamespace</groupId>
<artifactId>my-artifact-name</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<beam.version>2.5.0</beam.version>
<maven-compiler-plugin.version>3.6.2</maven-compiler-plugin.version>
<maven-exec-plugin.version>1.6.0</maven-exec-plugin.version>
<slf4j.version>1.7.25</slf4j.version>
</properties>
<repositories>
<repository>
<id>apache.snapshots</id>
<name>Apache Development Snapshot Repository</name>
<url>https://repository.apache.org/content/repositories/snapshots/</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven-compiler-plugin.version}</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>${maven-exec-plugin.version}</version>
<configuration>
<cleanupDaemonThreads>false</cleanupDaemonThreads>
</configuration>
</plugin>
<!--This plugin's configuration is used to store Eclipse m2e settings only. It has no influence on the Maven build itself.-->
<plugin>
<groupId>org.eclipse.m2e</groupId>
<artifactId>lifecycle-mapping</artifactId>
<version>1.0.0</version>
<configuration>
<lifecycleMappingMetadata>
<pluginExecutions>
<pluginExecution>
<pluginExecutionFilter>
<groupId>
org.apache.maven.plugins
</groupId>
<artifactId>
maven-compiler-plugin
</artifactId>
<versionRange>
[@maven-compiler-plugin.version@,)
</versionRange>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</pluginExecutionFilter>
<action>
<ignore></ignore>
</action>
</pluginExecution>
</pluginExecutions>
</lifecycleMappingMetadata>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
<dependencies>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>${beam.version}</version>
</dependency>
<dependency>
<groupId>com.google.cloud.dataflow</groupId>
<artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
<version>${beam.version}</version>
</dependency>
<!-- slf4j API frontend binding with JUL backend -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-jdk14</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.msgpack</groupId>
<artifactId>msgpack-core</artifactId>
<version>0.8.16</version>
</dependency>
</dependencies>
</project>