我想使用Spark runner运行管道,数据存储在远程计算机上。以下命令已用于提交作业:
./spark-submit --class org.apache.beam.examples.WindowedWordCount --master spark://192.168.1.214:6066 --deploy-mode cluster --supervise --executor-memory 2G --total-executor-cores 4 hdfs://192.168.1.214:9000/input/word-count-ck-0.1.jar --runner=SparkRunner
它正在输出目录中创建一个临时文件'.temp-beam-2017-07-184_19-10-19-0'。但是,在完成写操作时它会抛出IllegalArgumentException(参见下面的日志):
17/07/04 00:40:29 INFO Executor: Adding file:/usr/local/spark/spark-1.6.3-bin-hadoop2.6/work/app-20170704004020-0000/1/./word-count-ck-0.1.jar to class loader
17/07/04 00:40:29 INFO TorrentBroadcast: Started reading broadcast variable 0
17/07/04 00:40:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.4 KB, free 1247.2 MB)
17/07/04 00:40:29 INFO TorrentBroadcast: Reading broadcast variable 0 took 102 ms
17/07/04 00:40:30 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 15.6 KB, free 1247.2 MB)
17/07/04 00:40:30 INFO CacheManager: Partition rdd_0_1 not found, computing it
17/07/04 00:40:31 INFO MemoryStore: Block rdd_0_1 stored as values in memory (estimated size 292.9 KB, free 1246.9 MB)
17/07/04 00:40:33 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 9074 bytes result sent to driver
17/07/04 00:40:34 INFO CoarseGrainedExecutorBackend: Got assigned task 2
17/07/04 00:40:34 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
17/07/04 00:40:34 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache
17/07/04 00:40:34 INFO TorrentBroadcast: Started reading broadcast variable 1
17/07/04 00:40:34 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 8.4 KB, free 1246.9 MB)
17/07/04 00:40:34 INFO TorrentBroadcast: Reading broadcast variable 1 took 76 ms
17/07/04 00:40:34 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 18.4 KB, free 1246.9 MB)
17/07/04 00:40:34 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them
17/07/04 00:40:34 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@192.168.1.214:35429)
17/07/04 00:40:34 INFO MapOutputTrackerWorker: Got the output locations
17/07/04 00:40:34 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/07/04 00:40:34 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 11 ms
17/07/04 00:40:34 INFO WriteFiles: Opening writer 59cdfe11-3fff-4188-b9ca-17fce87d3ee2 for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:30:00.000Z..2017-07-03T19:40:00.000Z) pane PaneInfo.NO_FIRING
17/07/04 00:40:34 INFO WriteFiles: Opening writer f69da197-b163-4eee-8456-12cd7435ba8d for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:50:00.000Z..2017-07-03T20:00:00.000Z) pane PaneInfo.NO_FIRING
17/07/04 00:40:34 INFO WriteFiles: Opening writer 99089ec2-d54f-492c-8dc6-7dbdb0d6ab8d for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:10:00.000Z..2017-07-03T19:20:00.000Z) pane PaneInfo.NO_FIRING
17/07/04 00:40:34 INFO WriteFiles: Opening writer a5f369b3-355b-4db7-b894-c8b35c20b274 for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:20:00.000Z..2017-07-03T19:30:00.000Z) pane PaneInfo.NO_FIRING
17/07/04 00:40:34 INFO WriteFiles: Opening writer bd717b0e-af3c-461c-a554-3b472369bc84 for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T19:40:00.000Z..2017-07-03T19:50:00.000Z) pane PaneInfo.NO_FIRING
17/07/04 00:40:34 INFO WriteFiles: Opening writer f0135931-d310-4e31-8b58-caa752e75e6b for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T20:00:00.000Z..2017-07-03T20:10:00.000Z) pane PaneInfo.NO_FIRING
17/07/04 00:40:34 INFO WriteFiles: Opening writer e6a9e90d-ef5d-46d2-8fc2-8b374727e63e for write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}, window [2017-07-03T20:10:00.000Z..2017-07-03T20:20:00.000Z) pane PaneInfo.NO_FIRING
17/07/04 00:40:35 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 6819 bytes result sent to driver
17/07/04 00:40:35 INFO CoarseGrainedExecutorBackend: Got assigned task 5
17/07/04 00:40:35 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
17/07/04 00:40:35 INFO MapOutputTrackerWorker: Updating epoch to 2 and clearing cache
17/07/04 00:40:35 INFO TorrentBroadcast: Started reading broadcast variable 2
17/07/04 00:40:35 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 9.8 KB, free 1246.9 MB)
17/07/04 00:40:35 INFO TorrentBroadcast: Reading broadcast variable 2 took 10 ms
17/07/04 00:40:35 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 22.2 KB, free 1246.9 MB)
17/07/04 00:40:35 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
17/07/04 00:40:35 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@192.168.1.214:35429)
17/07/04 00:40:35 INFO MapOutputTrackerWorker: Got the output locations
17/07/04 00:40:35 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/07/04 00:40:35 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 2 ms
17/07/04 00:40:36 INFO WriteFiles: Finalizing write operation TextWriteOperation{tempDirectory=hdfs://192.168.1.214:9000/beamWorks/ckoutput/.temp-beam-2017-07-184_19-10-19-0/, windowedWrites=true}.
17/07/04 00:40:36 ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 5)
org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds have the same scheme, but received hdfs, ck2-19.
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
at org.apache.beam.sdk.io.WriteFiles$1$auxiliary$LSdUfeyo.invokeProcessElement(Unknown Source)
at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:197)
at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:158)
at org.apache.beam.runners.spark.translation.DoFnRunnerWithMetrics.processElement(DoFnRunnerWithMetrics.java:64)
at org.apache.beam.runners.spark.translation.SparkProcessContext$ProcCtxtIterator.computeNext(SparkProcessContext.java:165)
at org.apache.beam.runners.spark.repackaged.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at org.apache.beam.runners.spark.repackaged.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds have the same scheme, but received hdfs, ck2-19.
at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
at org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:398)
at org.apache.beam.sdk.io.FileSystems.copy(FileSystems.java:240)
at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.copyToOutputFiles(FileBasedSink.java:641)
at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.finalize(FileBasedSink.java:529)
at org.apache.beam.sdk.io.WriteFiles$1.processElement(WriteFiles.java:539)
以下是我在项目中使用的插件和依赖项:
<packaging>jar</packaging>
<properties>
<beam.version>2.0.0</beam.version>
<surefire-plugin.version>2.20</surefire-plugin.version>
</properties>
<repositories>
<repository>
<id>apache.snapshots</id>
<name>Apache Development Snapshot Repository</name>
<url>https://repository.apache.org/content/repositories/snapshots/</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-spark</artifactId>
<version>${beam.version}</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-hadoop-file-system</artifactId>
<version>${beam.version}</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.3</version>
<scope>runtime</scope>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>jul-to-slf4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-flink_2.10</artifactId>
<version>${beam.version}</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.10</artifactId>
<version>2.8.8</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>${beam.version}</version>
<!-- <exclusions>
<exclusion>
<artifactId>beam-sdks-java-core</artifactId>
</exclusion>
</exclusions> -->
</dependency>
<!-- Adds a dependency on the Beam Google Cloud Platform IO module. -->
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
<version>${beam.version}</version>
</dependency>
<!-- Dependencies below this line are specific dependencies needed by the examples code. -->
<dependency>
<groupId>com.google.api-client</groupId>
<artifactId>google-api-client</artifactId>
<version>1.22.0</version>
<exclusions>
<!-- Exclude an old version of guava that is being pulled
in by a transitive dependency of google-api-client -->
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava-jdk5</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.google.apis</groupId>
<artifactId>google-api-services-bigquery</artifactId>
<version>v2-rev295-1.22.0</version>
<exclusions>
<!-- Exclude an old version of guava that is being pulled
in by a transitive dependency of google-api-client -->
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava-jdk5</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.google.http-client</groupId>
<artifactId>google-http-client</artifactId>
<version>1.22.0</version>
<exclusions>
<!-- Exclude an old version of guava that is being pulled
in by a transitive dependency of google-api-client -->
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava-jdk5</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.google.apis</groupId>
<artifactId>google-api-services-pubsub</artifactId>
<version>v1-rev10-1.22.0</version>
<exclusions>
<!-- Exclude an old version of guava that is being pulled
in by a transitive dependency of google-api-client -->
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava-jdk5</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<version>2.4</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>20.0</version>
</dependency>
<!-- Add slf4j API frontend binding with JUL backend -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.14</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-jdk14</artifactId>
<version>1.7.14</version>
<!-- When loaded at runtime this will wire up slf4j to the JUL backend -->
<scope>runtime</scope>
</dependency>
<!-- Hamcrest and JUnit are required dependencies of PAssert,
which is used in the main code of DebuggingWordCount example. -->
<dependency>
<groupId>org.hamcrest</groupId>
<artifactId>hamcrest-all</artifactId>
<version>1.3</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-hadoop-common</artifactId>
<version>${beam.version}</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-hadoop-file-system</artifactId>
<version>${beam.version}</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-hadoop-input-format</artifactId>
<version>${beam.version}</version>
</dependency>
<!-- The DirectRunner is needed for unit tests. -->
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>${beam.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.0.0-alpha2</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>${surefire-plugin.version}</version>
<configuration>
<parallel>all</parallel>
<threadCount>4</threadCount>
<redirectTestOutputToFile>true</redirectTestOutputToFile>
</configuration>
<dependencies>
<dependency>
<groupId>org.apache.maven.surefire</groupId>
<artifactId>surefire-junit47</artifactId>
<version>${surefire-plugin.version}</version>
</dependency>
</dependencies>
</plugin>
<!-- Ensure that the Maven jar plugin runs before the Maven
shade plugin by listing the plugin higher within the file. -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
</plugin>
<!--
Configures `mvn package` to produce a bundled jar ("fat jar") for runners
that require this for job submission to a cluster.
-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/LICENSE</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.4.0</version>
<configuration>
<cleanupDaemonThreads>false</cleanupDaemonThreads>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
以下是WindowedWordCount类的源代码:
package org.apache.beam.examples;
import java.io.IOException;
import java.util.concurrent.ThreadLocalRandom;
import org.apache.beam.examples.common.ExampleBigQueryTableOptions;
import org.apache.beam.examples.common.ExampleOptions;
import org.apache.beam.examples.common.WriteOneFilePerWindow;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.DefaultValueFactory;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.joda.time.Duration;
import org.joda.time.Instant;
public class WindowedWordCount {
static final int WINDOW_SIZE = 10; // Default window duration in minutes
static class AddTimestampFn extends DoFn<String, String> {
private static final Duration RAND_RANGE = Duration.standardHours(1);
private final Instant minTimestamp;
private final Instant maxTimestamp;
AddTimestampFn(Instant minTimestamp, Instant maxTimestamp) {
this.minTimestamp = minTimestamp;
this.maxTimestamp = maxTimestamp;
}
@ProcessElement
public void processElement(ProcessContext c) {
Instant randomTimestamp =
new Instant(
ThreadLocalRandom.current()
.nextLong(minTimestamp.getMillis(), maxTimestamp.getMillis()));
/**
* Concept #2: Set the data element with that timestamp.
*/
c.outputWithTimestamp(c.element(), new Instant(randomTimestamp));
}
}
/** A {@link DefaultValueFactory} that returns the current system time. */
public static class DefaultToCurrentSystemTime implements DefaultValueFactory<Long> {
// @Override
public Long create(PipelineOptions options) {
return System.currentTimeMillis();
}
}
/** A {@link DefaultValueFactory} that returns the minimum timestamp plus one hour. */
public static class DefaultToMinTimestampPlusOneHour implements DefaultValueFactory<Long> {
// @Override
public Long create(PipelineOptions options) {
return options.as(Options.class).getMinTimestampMillis()
+ Duration.standardHours(1).getMillis();
}
}
/**
* Options for {@link WindowedWordCount}.
*
* <p>Inherits standard example configuration options, which allow specification of the
* runner, as well as the {@link WordCount.WordCountOptions} support for
* specification of the input and output files.
*/
public interface Options extends WordCount.WordCountOptions,
ExampleOptions, ExampleBigQueryTableOptions {
@Description("Fixed window duration, in minutes")
@Default.Integer(WINDOW_SIZE)
Integer getWindowSize();
void setWindowSize(Integer value);
@Description("Minimum randomly assigned timestamp, in milliseconds-since-epoch")
@Default.InstanceFactory(DefaultToCurrentSystemTime.class)
Long getMinTimestampMillis();
void setMinTimestampMillis(Long value);
@Description("Maximum randomly assigned timestamp, in milliseconds-since-epoch")
@Default.InstanceFactory(DefaultToMinTimestampPlusOneHour.class)
Long getMaxTimestampMillis();
void setMaxTimestampMillis(Long value);
@Description("Fixed number of shards to produce per window, or null for runner-chosen sharding")
Integer getNumShards();
void setNumShards(Integer numShards);
}
public static void main(String[] args) throws IOException {
String[] args1 =new String[]{ "--hdfsConfiguration=[{\"fs.defaultFS\" : \"hdfs://192.168.1.214:9000\"}]","--runner=SparkRunner"};
Options options = PipelineOptionsFactory.fromArgs(args1).withValidation().as(Options.class);
final String output = options.getOutput();
final Instant minTimestamp = new Instant(options.getMinTimestampMillis());
final Instant maxTimestamp = new Instant(options.getMaxTimestampMillis());
Pipeline pipeline = Pipeline.create(options);
/**
* Concept #1: the Beam SDK lets us run the same pipeline with either a bounded or
* unbounded input source.
*/
PCollection<String> input = pipeline
/** Read from the GCS file. */
.apply(TextIO.read().from(options.getInputFile()))
// Concept #2: Add an element timestamp, using an artificial time just to show windowing.
// See AddTimestampFn for more detail on this.
.apply(ParDo.of(new AddTimestampFn(minTimestamp, maxTimestamp)));
/**
* Concept #3: Window into fixed windows. The fixed window size for this example defaults to 1
* minute (you can change this with a command-line option). See the documentation for more
* information on how fixed windows work, and for information on the other types of windowing
* available (e.g., sliding windows).
*/
PCollection<String> windowedWords =
input.apply(
Window.<String>into(
FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));
/**
* Concept #4: Re-use our existing CountWords transform that does not have knowledge of
* windows over a PCollection containing windowed values.
*/
PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new WordCount.CountWords());
/**
* Concept #5: Format the results and write to a sharded file partitioned by window, using a
* simple ParDo operation. Because there may be failures followed by retries, the
* writes must be idempotent, but the details of writing to files is elided here.
*/
wordCounts
.apply(MapElements.via(new WordCount.FormatAsTextFn()))
.apply(new WriteOneFilePerWindow(output, options.getNumShards()));
PipelineResult result = pipeline.run();
try {
result.waitUntilFinish();
} catch (Exception exc) {
result.cancel();
}
}
}