Apache Beam:'期望srcResourceIds和destResourceIds具有相同的方案,但收到了hdfs,filename'

时间:2017-06-30 12:29:57

标签: apache-spark hdfs apache-beam

我想使用Spark runner运行管道,数据存储在远程计算机上。以下命令已用于提交作业:

./spark-submit   --class org.apache.beam.examples.WindowedWordCount   --master spark://192.168.1.214:6066   --deploy-mode cluster   --supervise   --executor-memory 2G   --total-executor-cores 4 hdfs://192.168.1.214:9000/input/word-count-ck-0.1.jar --runner=SparkRunner

它正在输出目录中创建一个临时文件'.temp-beam-2017-07-184_19-10-19-0'。但是,在完成写操作时它会抛出IllegalArgumentException(参见下面的日志):

    17/07/04 00:40:29 INFO Executor: Adding file:/usr/local/spark/spark-1.6.3-bin-hadoop2.6/work/app-20170704004020-0000/1/./word-count-ck-0.1.jar to class loader
    17/07/04 00:40:29 INFO TorrentBroadcast: Started reading broadcast variable 0
    17/07/04 00:40:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.4 KB, free 1247.2 MB)
    17/07/04 00:40:29 INFO TorrentBroadcast: Reading broadcast variable 0 took 102 ms
    17/07/04 00:40:30 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 15.6 KB, free 1247.2 MB)
    17/07/04 00:40:30 INFO CacheManager: Partition rdd_0_1 not found, computing it
    17/07/04 00:40:31 INFO MemoryStore: Block rdd_0_1 stored as values in memory (estimated size 292.9 KB, free 1246.9 MB)
    17/07/04 00:40:33 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 9074 bytes result sent to driver
    17/07/04 00:40:34 INFO CoarseGrainedExecutorBackend: Got assigned task 2
    17/07/04 00:40:34 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
    17/07/04 00:40:34 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache
    17/07/04 00:40:34 INFO TorrentBroadcast: Started reading broadcast variable 1
    17/07/04 00:40:34 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 8.4 KB, free 1246.9 MB)
    17/07/04 00:40:34 INFO TorrentBroadcast: Reading broadcast variable 1 took 76 ms
    17/07/04 00:40:34 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 18.4 KB, free 1246.9 MB)
    17/07/04 00:40:34 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them
    17/07/04 00:40:34 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@192.168.1.214:35429)
    17/07/04 00:40:34 INFO MapOutputTrackerWorker: Got the output locations
    17/07/04 00:40:34 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
    17/07/04 00:40:34 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 11 ms
    17/07/04 00:40:34 INFO WriteFiles: Opening writer 59cdfe11-3fff-4188-b9ca-17fce87d3ee2 for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:30:00.000Z..2017-07-03T19:40:00.000Z) pane PaneInfo.NO_FIRING
    17/07/04 00:40:34 INFO WriteFiles: Opening writer f69da197-b163-4eee-8456-12cd7435ba8d for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:50:00.000Z..2017-07-03T20:00:00.000Z) pane PaneInfo.NO_FIRING
    17/07/04 00:40:34 INFO WriteFiles: Opening writer 99089ec2-d54f-492c-8dc6-7dbdb0d6ab8d for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:10:00.000Z..2017-07-03T19:20:00.000Z) pane PaneInfo.NO_FIRING
    17/07/04 00:40:34 INFO WriteFiles: Opening writer a5f369b3-355b-4db7-b894-c8b35c20b274 for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:20:00.000Z..2017-07-03T19:30:00.000Z) pane PaneInfo.NO_FIRING
    17/07/04 00:40:34 INFO WriteFiles: Opening writer bd717b0e-af3c-461c-a554-3b472369bc84 for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:40:00.000Z..2017-07-03T19:50:00.000Z) pane PaneInfo.NO_FIRING
    17/07/04 00:40:34 INFO WriteFiles: Opening writer f0135931-d310-4e31-8b58-caa752e75e6b for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T20:00:00.000Z..2017-07-03T20:10:00.000Z) pane PaneInfo.NO_FIRING
    17/07/04 00:40:34 INFO WriteFiles: Opening writer e6a9e90d-ef5d-46d2-8fc2-8b374727e63e for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T20:10:00.000Z..2017-07-03T20:20:00.000Z) pane PaneInfo.NO_FIRING
    17/07/04 00:40:35 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 6819 bytes result sent to driver
    17/07/04 00:40:35 INFO CoarseGrainedExecutorBackend: Got assigned task 5
    17/07/04 00:40:35 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
    17/07/04 00:40:35 INFO MapOutputTrackerWorker: Updating epoch to 2 and clearing cache
    17/07/04 00:40:35 INFO TorrentBroadcast: Started reading broadcast variable 2
    17/07/04 00:40:35 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 9.8 KB, free 1246.9 MB)
    17/07/04 00:40:35 INFO TorrentBroadcast: Reading broadcast variable 2 took 10 ms
    17/07/04 00:40:35 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 22.2 KB, free 1246.9 MB)
    17/07/04 00:40:35 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
    17/07/04 00:40:35 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@192.168.1.214:35429)
    17/07/04 00:40:35 INFO MapOutputTrackerWorker: Got the output locations
    17/07/04 00:40:35 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
    17/07/04 00:40:35 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 2 ms
    17/07/04 00:40:36 INFO WriteFiles: Finalizing write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}.
    17/07/04 00:40:36 ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 5)
    org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds have the same scheme, but received hdfs, ck2-19.
        at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
        at org.apache.beam.sdk.io.WriteFiles$1$auxiliary$LSdUfeyo.invokeProcessElement(Unknown Source)
        at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:197)
        at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:158)
        at org.apache.beam.runners.spark.translation.DoFnRunnerWithMetrics.processElement(DoFnRunnerWithMetrics.java:64)
        at org.apache.beam.runners.spark.translation.SparkProcessContext$ProcCtxtIterator.computeNext(SparkProcessContext.java:165)
        at org.apache.beam.runners.spark.repackaged.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
        at org.apache.beam.runners.spark.repackaged.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
        at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds have the same scheme, but received hdfs, ck2-19.
        at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
        at org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:398)
        at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:240)
        at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles(FileBasedSink.java:641)
        at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBasedSink.java:529)
        at org.apache.beam.sdk.io.WriteFiles$1.processElement(WriteFiles.java:539)

以下是我在项目中使用的插件和依赖项:

<packaging>jar</packaging>

        <properties>
        <beam.version>2.0.0</beam.version>
        <surefire-plugin.version>2.20</surefire-plugin.version>
    </properties>

    <repositories>
        <repository>
            <id>apache.snapshots</id>
            <name>Apache Development Snapshot Repository</name>
            <url>https://repository.apache.org/content/repositories/snapshots/</url>
            <releases>
                <enabled>false</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-spark</artifactId>
            <version>${beam.version}</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-hadoop-file-system</artifactId>
            <version>${beam.version}</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>1.6.3</version>
            <scope>runtime</scope>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>jul-to-slf4j</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-flink_2.10</artifactId>
            <version>${beam.version}</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.module</groupId>
            <artifactId>jackson-module-scala_2.10</artifactId>
            <version>2.8.8</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-core</artifactId>
            <version>${beam.version}</version>
    <!--         <exclusions>
            <exclusion>
            <artifactId>beam-sdks-java-core</artifactId>
            </exclusion>
            </exclusions> -->
        </dependency>

        <!-- Adds a dependency on the Beam Google Cloud Platform IO module. -->
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
            <version>${beam.version}</version>
        </dependency>

        <!-- Dependencies below this line are specific dependencies needed by the examples code. -->
        <dependency>
            <groupId>com.google.api-client</groupId>
            <artifactId>google-api-client</artifactId>
            <version>1.22.0</version>
            <exclusions>
                <!-- Exclude an old version of guava that is being pulled
                     in by a transitive dependency of google-api-client -->
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.google.apis</groupId>
            <artifactId>google-api-services-bigquery</artifactId>
            <version>v2-rev295-1.22.0</version>
            <exclusions>
                <!-- Exclude an old version of guava that is being pulled
                     in by a transitive dependency of google-api-client -->
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.google.http-client</groupId>
            <artifactId>google-http-client</artifactId>
            <version>1.22.0</version>
            <exclusions>
                <!-- Exclude an old version of guava that is being pulled
                     in by a transitive dependency of google-api-client -->
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.google.apis</groupId>
            <artifactId>google-api-services-pubsub</artifactId>
            <version>v1-rev10-1.22.0</version>
            <exclusions>
                <!-- Exclude an old version of guava that is being pulled
                     in by a transitive dependency of google-api-client -->
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>2.4</version>
        </dependency>

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>20.0</version>
        </dependency>

        <!-- Add slf4j API frontend binding with JUL backend -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.14</version>
        </dependency>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-jdk14</artifactId>
            <version>1.7.14</version>
            <!-- When loaded at runtime this will wire up slf4j to the JUL backend -->
            <scope>runtime</scope>
        </dependency>

        <!-- Hamcrest and JUnit are required dependencies of PAssert,
             which is used in the main code of DebuggingWordCount example. -->
        <dependency>
            <groupId>org.hamcrest</groupId>
            <artifactId>hamcrest-all</artifactId>
            <version>1.3</version>
        </dependency>

        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>

        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-hadoop-common</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-hadoop-file-system</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-hadoop-input-format</artifactId>
            <version>${beam.version}</version>
        </dependency>

        <!-- The DirectRunner is needed for unit tests. -->
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-direct-java</artifactId>
            <version>${beam.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.0.0-alpha2</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>${surefire-plugin.version}</version>
                <configuration>
                    <parallel>all</parallel>
                    <threadCount>4</threadCount>
                    <redirectTestOutputToFile>true</redirectTestOutputToFile>
                </configuration>
                <dependencies>
                    <dependency>
                        <groupId>org.apache.maven.surefire</groupId>
                        <artifactId>surefire-junit47</artifactId>
                        <version>${surefire-plugin.version}</version>
                    </dependency>
                </dependencies>
            </plugin>

            <!-- Ensure that the Maven jar plugin runs before the Maven
              shade plugin by listing the plugin higher within the file. -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
            </plugin>



            <!--
              Configures `mvn package` to produce a bundled jar ("fat jar") for runners
              that require this for job submission to a cluster.
            -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.0.0</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/LICENSE</exclude>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>

        <pluginManagement>
            <plugins>
                <plugin>
                    <groupId>org.codehaus.mojo</groupId>
                    <artifactId>exec-maven-plugin</artifactId>
                    <version>1.4.0</version>
                    <configuration>
                        <cleanupDaemonThreads>false</cleanupDaemonThreads>
                    </configuration>
                </plugin>
            </plugins>
        </pluginManagement>
    </build>
    </project>

以下是WindowedWordCount类的源代码:

    package org.apache.beam.examples;
    import java.io.IOException;
    import java.util.concurrent.ThreadLocalRandom;
    import        org.apache.beam.examples.common.ExampleBigQueryTableOptions;
    import org.apache.beam.examples.common.ExampleOptions;
    import org.apache.beam.examples.common.WriteOneFilePerWindow;
    import org.apache.beam.sdk.Pipeline;
    import org.apache.beam.sdk.PipelineResult;
    import org.apache.beam.sdk.io.TextIO;
    import org.apache.beam.sdk.options.Default;
    import org.apache.beam.sdk.options.DefaultValueFactory;
    import org.apache.beam.sdk.options.Description;
    import org.apache.beam.sdk.options.PipelineOptions;
    import org.apache.beam.sdk.options.PipelineOptionsFactory;
    import org.apache.beam.sdk.transforms.DoFn;
    import org.apache.beam.sdk.transforms.MapElements;
    import org.apache.beam.sdk.transforms.ParDo;
    import org.apache.beam.sdk.transforms.windowing.FixedWindows;
    import org.apache.beam.sdk.transforms.windowing.Window;
    import org.apache.beam.sdk.values.KV;
    import org.apache.beam.sdk.values.PCollection;
    import org.joda.time.Duration;
    import org.joda.time.Instant;

    public class WindowedWordCount {
        static final int WINDOW_SIZE = 10;  // Default window duration in minutes

      static class AddTimestampFn extends DoFn<String, String> {
        private static final Duration RAND_RANGE = Duration.standardHours(1);
        private final Instant minTimestamp;
        private final Instant maxTimestamp;

        AddTimestampFn(Instant minTimestamp, Instant maxTimestamp) {
          this.minTimestamp = minTimestamp;
          this.maxTimestamp = maxTimestamp;
        }

        @ProcessElement
        public void processElement(ProcessContext c) {
          Instant randomTimestamp =
              new Instant(
                  ThreadLocalRandom.current()
                      .nextLong(minTimestamp.getMillis(), maxTimestamp.getMillis()));

          /**
           * Concept #2: Set the data element with that timestamp.
           */
          c.outputWithTimestamp(c.element(), new Instant(randomTimestamp));
        }
      }

      /** A {@link DefaultValueFactory} that returns the current system time. */
      public static class DefaultToCurrentSystemTime implements DefaultValueFactory<Long> {
       // @Override
        public Long create(PipelineOptions options) {
          return System.currentTimeMillis();
        }
      }

      /** A {@link DefaultValueFactory} that returns the minimum timestamp plus one hour. */
      public static class DefaultToMinTimestampPlusOneHour implements DefaultValueFactory<Long> {
       // @Override
        public Long create(PipelineOptions options) {
          return options.as(Options.class).getMinTimestampMillis()
              + Duration.standardHours(1).getMillis();
        }
      }

      /**
       * Options for {@link WindowedWordCount}.
       *
       * <p>Inherits standard example configuration options, which allow specification of the
       * runner, as well as the {@link WordCount.WordCountOptions} support for
       * specification of the input and output files.
       */
      public interface Options extends WordCount.WordCountOptions,
          ExampleOptions, ExampleBigQueryTableOptions {
        @Description("Fixed window duration, in minutes")
        @Default.Integer(WINDOW_SIZE)
        Integer getWindowSize();
        void setWindowSize(Integer value);

        @Description("Minimum randomly assigned timestamp, in milliseconds-since-epoch")
        @Default.InstanceFactory(DefaultToCurrentSystemTime.class)
        Long getMinTimestampMillis();
        void setMinTimestampMillis(Long value);

        @Description("Maximum randomly assigned timestamp, in milliseconds-since-epoch")
        @Default.InstanceFactory(DefaultToMinTimestampPlusOneHour.class)
        Long getMaxTimestampMillis();
        void setMaxTimestampMillis(Long value);

        @Description("Fixed number of shards to produce per window, or null for runner-chosen sharding")
        Integer getNumShards();
        void setNumShards(Integer numShards);
      }

      public static void main(String[] args) throws IOException {
         String[] args1 =new String[]{ "--hdfsConfiguration=[{\"fs.defaultFS\" : \"hdfs://192.168.1.214:9000\"}]","--runner=SparkRunner"};  
        Options options = PipelineOptionsFactory.fromArgs(args1).withValidation().as(Options.class);
        final String output = options.getOutput();
        final Instant minTimestamp = new Instant(options.getMinTimestampMillis());
        final Instant maxTimestamp = new Instant(options.getMaxTimestampMillis());

        Pipeline pipeline = Pipeline.create(options);

        /**
         * Concept #1: the Beam SDK lets us run the same pipeline with either a bounded or
         * unbounded input source.
         */
        PCollection<String> input = pipeline
          /** Read from the GCS file. */
          .apply(TextIO.read().from(options.getInputFile()))
          // Concept #2: Add an element timestamp, using an artificial time just to show windowing.
          // See AddTimestampFn for more detail on this.
          .apply(ParDo.of(new AddTimestampFn(minTimestamp, maxTimestamp)));

        /**
         * Concept #3: Window into fixed windows. The fixed window size for this example defaults to 1
         * minute (you can change this with a command-line option). See the documentation for more
         * information on how fixed windows work, and for information on the other types of windowing
         * available (e.g., sliding windows).
         */
        PCollection<String> windowedWords =
            input.apply(
                Window.<String>into(
                    FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));

        /**
         * Concept #4: Re-use our existing CountWords transform that does not have knowledge of
         * windows over a PCollection containing windowed values.
         */
        PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new WordCount.CountWords());

        /**
         * Concept #5: Format the results and write to a sharded file partitioned by window, using a
         * simple ParDo operation. Because there may be failures followed by retries, the
         * writes must be idempotent, but the details of writing to files is elided here.
         */
        wordCounts
            .apply(MapElements.via(new WordCount.FormatAsTextFn()))
            .apply(new WriteOneFilePerWindow(output, options.getNumShards()));

        PipelineResult result = pipeline.run();
        try {
          result.waitUntilFinish();
        } catch (Exception exc) {
          result.cancel();
        }
      }

    }

0 个答案:

没有答案