Apache Beam for Google Cloud DataFlow-使用BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED

时间:2019-05-01 06:08:06

标签: java google-bigquery apache-beam

当我运行Beam DataFlow(用Java编写)以将数据输入到Google BigQuery时遇到问题。这个问题发生在我的开发环境和生产环境中。

我正在通过单独线程上的Beam管道处理多个数据项。我正在使用ParDo进行处理,该applyPCollection上使用transformedData .apply("Load fact data", BigQueryIO.<ValidatedDataRecord>write() .to(new LoadDataFact.DynamicFactTableDestination(dataType.label)) .withFormatFunction(new LoadDataFact.FactSerializationFn()) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); 转换接收到的数据。对于转换后的数据,然后尝试将其写入Google BigQuery。

我正在尝试使用如下所示的例程将转换后的数据写入BigQuery:

.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED

这里重要的是,我正在使用以下内容:

Exception thrown in class : com.myOrg.myPackage.myClass Error : java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table my-project:my-dataset.my-table",
"reason" : "notFound"
} ],
"message" : "Not found: Table my-project:my-dataset.my-table",
"status" : "NOT_FOUND"
}

...因为我希望创建一个尚不存在的新表

问题在于,当新数据成为数据流的一部分,并且该数据需要该数据的新数据集时,我经常(但不总是)遇到以下404错误:

CreateTables.class

我研究了底层Beam SDKs Java Core库中发生的事情。发生的事情是,当处理第一个新数据项时,“ Beam SDKs Java Core”库中的CreateTables.createdTables应该尝试创建一个新表,如果成功,它将向静态{{1 }}收集新表,以指示该表已创建。在创建表之后似乎会发生404错误(尽管有时,表完全无法创建)。我不确定是什么原因导致了此行为,或者是什么触发了404错误(错误消息没有提供太多信息)。 BigQuery中的此类错误通常是因为在访问表的任何阶段都不存在该表。

CreateTables.class是以下Beam SDK SDK Java Core库的一部分:

C:\Users\my.username\.m2\repository\org\apache\beam\beam-sdks-java-io-google-cloud-platform\2.5.0\beam-sdks-java-io-google-cloud-platform-2.5.0.jar!\org\apache\beam\sdk\io\gcp\bigquery\CreateTables.class

我在Beam Jira上提出了一个错误-https://issues.apache.org/jira/browse/BEAM-7195

我尝试将Beam SDK的Java Core库更新为v 2.12.0,但是由于某种原因,数据流完全停止工作。

有没有一种方法可以解决此问题?我正在使用一个自定义类重写Beam用于处理的DynamicDestinations类(这是文档指定应实现动态数据目标的方式-请参见https://beam.apache.org/documentation/io/built-in/google-bigquery/#using-dynamic-destinations)。

我的自定义类如下所示,在处理每个数据项期间调用getTable方法,以确定应将数据添加到的表:

public class LoadDataFact {

    public static class DynamicFactTableDestination extends DynamicDestinations<ValidatedDataRecord, String> {
        private static final long serialVersionUID = -1234561111111123456L;
        private static final String projectID = "my-project";
        private String dataType = "none";
        private String elementDuration = "unknown";

        public DynamicFactTableDestination(String dataType) {
            this.dataType = dataType;
        }

        @Override
        public String getDestination(ValueInSingleWindow<ValidatedDataRecord> element) {    

            return element.getValue().DatasetName;
        }

        @Override
        public TableDestination getTable(String destination) {

            try {
                return new TableDestination(new TableReference()
                        .setProjectId(projectID)
                        .setDatasetId(destination)
                        .setTableId(String.format("data_for_%s",this.dataType)), "Data staging table",
                        new TimePartitioning()
                                .setType("DAY")
                                .setField("created_date_time"));
            }
            catch (Exception ex) {
                System.out.println("Error " + ex.getMessage());
            }

            return null;
        }

这是我的pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!--
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.myOrg.myNamespace</groupId>
  <artifactId>my-artifact-name</artifactId>
  <version>1.0-SNAPSHOT</version>

  <packaging>jar</packaging>

  <properties>
    <beam.version>2.5.0</beam.version>
    <maven-compiler-plugin.version>3.6.2</maven-compiler-plugin.version>
    <maven-exec-plugin.version>1.6.0</maven-exec-plugin.version>
    <slf4j.version>1.7.25</slf4j.version>
  </properties>

  <repositories>
    <repository>
      <id>apache.snapshots</id>
      <name>Apache Development Snapshot Repository</name>
      <url>https://repository.apache.org/content/repositories/snapshots/</url>
      <releases>
        <enabled>false</enabled>
      </releases>
      <snapshots>
        <enabled>true</enabled>
      </snapshots>
    </repository>
  </repositories>

  <build>
   <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>${maven-compiler-plugin.version}</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
    </plugins>

    <pluginManagement>
      <plugins>
        <plugin>
          <groupId>org.codehaus.mojo</groupId>
          <artifactId>exec-maven-plugin</artifactId>
          <version>${maven-exec-plugin.version}</version>
          <configuration>
            <cleanupDaemonThreads>false</cleanupDaemonThreads>
          </configuration>
        </plugin>
        <!--This plugin's configuration is used to store Eclipse m2e settings only. It has no influence on the Maven build itself.-->
        <plugin>
           <groupId>org.eclipse.m2e</groupId>
           <artifactId>lifecycle-mapping</artifactId>
           <version>1.0.0</version>
           <configuration>
              <lifecycleMappingMetadata>
                 <pluginExecutions>
                    <pluginExecution>
                       <pluginExecutionFilter>
                          <groupId>
                             org.apache.maven.plugins
                          </groupId>
                          <artifactId>
                             maven-compiler-plugin
                          </artifactId>
                          <versionRange>
                             [@maven-compiler-plugin.version@,)
                          </versionRange>
                          <goals>
                             <goal>compile</goal>
                             <goal>testCompile</goal>
                          </goals>
                       </pluginExecutionFilter>
                       <action>
                          <ignore></ignore>
                       </action>
                    </pluginExecution>
                 </pluginExecutions>
              </lifecycleMappingMetadata>
           </configuration>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>

  <dependencies>

    <dependency>
      <groupId>org.apache.beam</groupId>
      <artifactId>beam-sdks-java-core</artifactId>
      <version>${beam.version}</version>
    </dependency>


     <dependency>
      <groupId>com.google.cloud.dataflow</groupId>
      <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
      <version>2.5.0</version>
     </dependency>

    <dependency>
      <groupId>org.apache.beam</groupId>
      <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
      <version>${beam.version}</version>
    </dependency>


    <!-- slf4j API frontend binding with JUL backend -->
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-api</artifactId>
      <version>${slf4j.version}</version>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-jdk14</artifactId>
      <version>${slf4j.version}</version>
    </dependency>
    <dependency>
      <groupId>org.msgpack</groupId>
      <artifactId>msgpack-core</artifactId>
      <version>0.8.16</version>
    </dependency>
  </dependencies>
</project>

0 个答案:

没有答案