我很难理解TextIO.write()的.withFileNamePolicy的概念。提供FileNamePolicy的要求似乎非常复杂,因为它可以像指定GCS存储桶来编写流式字段一样简单。
在较高的层面上,我将JSON消息流式传输到PubSub主题,并且我希望将这些原始消息写入GCS中的文件以进行永久存储(我还将对其进行其他处理)消息)。我最初开始使用这个Pipeline,认为这很简单:
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
p.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(topic))
.apply("Write to GCS", TextIO.write().to(gcs_bucket);
p.run();
}
我收到了需要WindowedWrites的错误,我申请了,然后需要FileNamePolicy。事情变得多毛了。
我去了梁文档并检查了FilenamePolicy。看起来我需要扩展这个类,然后还需要扩展其他抽象类来使其工作。不幸的是,关于Apache的文档有点不足,我找不到Dataflow 2.0这样做的任何示例,除了The Wordcount Example,它们甚至会在帮助器类中实现这些细节。
所以我可能只是通过复制大部分WordCount示例来完成这项工作,但我试图更好地理解这一点的细节。我有几个问题:
1)是否有任何路线图项目可以抽象出很多这种复杂性?看起来我应该像在非WindowsWrite中一样提供GCS存储桶,然后只提供一些基本选项,如时序和文件命名规则。我知道将流窗口数据写入文件比打开文件指针(或对象存储等效)更复杂。
2)看起来要做到这一点,我需要创建一个WindowedContext对象,它需要提供一个BoundedWindow抽象类,PaneInfo对象类,然后是一些分片信息。可用于这些的信息非常简单,我很难知道所有这些实际需要什么,特别是考虑到我的简单用例。有没有很好的例子可以实现这些?另外,看起来我还需要将#scilt设置为TextIO.write的一部分,但是还要将#shads作为fileNamePolicy的一部分提供?
感谢您帮助我理解这背后的细节,希望学到一些东西!
编辑7/20/17 所以我终于通过扩展FilenamePolicy来运行这个管道。我的挑战是需要从PubSub定义流数据的窗口。这是代码的非常接近的表示:
public class ReadData {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
p.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(topic))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
.apply("Write to GCS", TextIO.write().to("gcs_bucket")
.withWindowedWrites()
.withFilenamePolicy(new TestPolicy())
.withNumShards(10));
p.run();
}
}
class TestPolicy extends FileBasedSink.FilenamePolicy {
@Override
public ResourceId windowedFilename(
ResourceId outputDirectory, WindowedContext context, String extension) {
IntervalWindow window = (IntervalWindow) context.getWindow();
String filename = String.format(
"%s-%s-%s-%s-of-%s.json",
"test",
window.start().toString(),
window.end().toString(),
context.getShardNumber(),
context.getShardNumber()
);
return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
}
@Override
public ResourceId unwindowedFilename(
ResourceId outputDirectory, Context context, String extension) {
throw new UnsupportedOperationException("Unsupported.");
}
}
答案 0 :(得分:6)
在Beam 2.0中,下面是将PubSub中的原始消息写入GCS上的窗口文件的示例。管道是相当可配置的,如果您想要数据的逻辑子部分以便于重新处理/存档,则允许您通过参数和子目录策略指定窗口持续时间。请注意,这对Apache Commons Lang 3具有额外的依赖性。
<强> PubSubToGcs 强>
/**
* This pipeline ingests incoming data from a Cloud Pub/Sub topic and
* outputs the raw data into windowed files at the specified output
* directory.
*/
public class PubsubToGcs {
/**
* Options supported by the pipeline.
*
* <p>Inherits standard configuration options.</p>
*/
public static interface Options extends DataflowPipelineOptions, StreamingOptions {
@Description("The Cloud Pub/Sub topic to read from.")
@Required
ValueProvider<String> getTopic();
void setTopic(ValueProvider<String> value);
@Description("The directory to output files to. Must end with a slash.")
@Required
ValueProvider<String> getOutputDirectory();
void setOutputDirectory(ValueProvider<String> value);
@Description("The filename prefix of the files to write to.")
@Default.String("output")
@Required
ValueProvider<String> getOutputFilenamePrefix();
void setOutputFilenamePrefix(ValueProvider<String> value);
@Description("The shard template of the output file. Specified as repeating sequences "
+ "of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with the "
+ "shard number, or number of shards respectively")
@Default.String("")
ValueProvider<String> getShardTemplate();
void setShardTemplate(ValueProvider<String> value);
@Description("The suffix of the files to write.")
@Default.String("")
ValueProvider<String> getOutputFilenameSuffix();
void setOutputFilenameSuffix(ValueProvider<String> value);
@Description("The sub-directory policy which files will use when output per window.")
@Default.Enum("NONE")
SubDirectoryPolicy getSubDirectoryPolicy();
void setSubDirectoryPolicy(SubDirectoryPolicy value);
@Description("The window duration in which data will be written. Defaults to 5m. "
+ "Allowed formats are: "
+ "Ns (for seconds, example: 5s), "
+ "Nm (for minutes, example: 12m), "
+ "Nh (for hours, example: 2h).")
@Default.String("5m")
String getWindowDuration();
void setWindowDuration(String value);
@Description("The maximum number of output shards produced when writing.")
@Default.Integer(10)
Integer getNumShards();
void setNumShards(Integer value);
}
/**
* Main entry point for executing the pipeline.
* @param args The command-line arguments to the pipeline.
*/
public static void main(String[] args) {
Options options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(Options.class);
run(options);
}
/**
* Runs the pipeline with the supplied options.
*
* @param options The execution parameters to the pipeline.
* @return The result of the pipeline execution.
*/
public static PipelineResult run(Options options) {
// Create the pipeline
Pipeline pipeline = Pipeline.create(options);
/**
* Steps:
* 1) Read string messages from PubSub
* 2) Window the messages into minute intervals specified by the executor.
* 3) Output the windowed files to GCS
*/
pipeline
.apply("Read PubSub Events",
PubsubIO
.readStrings()
.fromTopic(options.getTopic()))
.apply(options.getWindowDuration() + " Window",
Window
.into(FixedWindows.of(parseDuration(options.getWindowDuration()))))
.apply("Write File(s)",
TextIO
.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(options.getOutputDirectory())
.withFilenamePolicy(
new WindowedFilenamePolicy(
options.getOutputFilenamePrefix(),
options.getShardTemplate(),
options.getOutputFilenameSuffix())
.withSubDirectoryPolicy(options.getSubDirectoryPolicy())));
// Execute the pipeline and return the result.
PipelineResult result = pipeline.run();
return result;
}
/**
* Parses a duration from a period formatted string. Values
* are accepted in the following formats:
* <p>
* Ns - Seconds. Example: 5s<br>
* Nm - Minutes. Example: 13m<br>
* Nh - Hours. Example: 2h
*
* <pre>
* parseDuration(null) = NullPointerException()
* parseDuration("") = Duration.standardSeconds(0)
* parseDuration("2s") = Duration.standardSeconds(2)
* parseDuration("5m") = Duration.standardMinutes(5)
* parseDuration("3h") = Duration.standardHours(3)
* </pre>
*
* @param value The period value to parse.
* @return The {@link Duration} parsed from the supplied period string.
*/
private static Duration parseDuration(String value) {
Preconditions.checkNotNull(value, "The specified duration must be a non-null value!");
PeriodParser parser = new PeriodFormatterBuilder()
.appendSeconds().appendSuffix("s")
.appendMinutes().appendSuffix("m")
.appendHours().appendSuffix("h")
.toParser();
MutablePeriod period = new MutablePeriod();
parser.parseInto(period, value, 0, Locale.getDefault());
Duration duration = period.toDurationFrom(new DateTime(0));
return duration;
}
}
的 WindowedFilenamePolicy 强>
/**
* The {@link WindowedFilenamePolicy} class will output files
* to the specified location with a format of output-yyyyMMdd'T'HHmmssZ-001-of-100.txt.
*/
@SuppressWarnings("serial")
public class WindowedFilenamePolicy extends FilenamePolicy {
/**
* Possible sub-directory creation modes.
*/
public static enum SubDirectoryPolicy {
NONE("."),
PER_HOUR("yyyy-MM-dd/HH"),
PER_DAY("yyyy-MM-dd");
private final String subDirectoryPattern;
private SubDirectoryPolicy(String subDirectoryPattern) {
this.subDirectoryPattern = subDirectoryPattern;
}
public String getSubDirectoryPattern() {
return subDirectoryPattern;
}
public String format(Instant instant) {
DateTimeFormatter formatter = DateTimeFormat.forPattern(subDirectoryPattern);
return formatter.print(instant);
}
}
/**
* The formatter used to format the window timestamp for outputting to the filename.
*/
private static final DateTimeFormatter formatter = ISODateTimeFormat
.basicDateTimeNoMillis()
.withZone(DateTimeZone.getDefault());
/**
* The filename prefix.
*/
private final ValueProvider<String> prefix;
/**
* The filenmae suffix.
*/
private final ValueProvider<String> suffix;
/**
* The shard template used during file formatting.
*/
private final ValueProvider<String> shardTemplate;
/**
* The policy which dictates when or if sub-directories are created
* for the windowed file output.
*/
private ValueProvider<SubDirectoryPolicy> subDirectoryPolicy = StaticValueProvider.of(SubDirectoryPolicy.NONE);
/**
* Constructs a new {@link WindowedFilenamePolicy} with the
* supplied prefix used for output files.
*
* @param prefix The prefix to append to all files output by the policy.
* @param shardTemplate The template used to create uniquely named sharded files.
* @param suffix The suffix to append to all files output by the policy.
*/
public WindowedFilenamePolicy(String prefix, String shardTemplate, String suffix) {
this(StaticValueProvider.of(prefix),
StaticValueProvider.of(shardTemplate),
StaticValueProvider.of(suffix));
}
/**
* Constructs a new {@link WindowedFilenamePolicy} with the
* supplied prefix used for output files.
*
* @param prefix The prefix to append to all files output by the policy.
* @param shardTemplate The template used to create uniquely named sharded files.
* @param suffix The suffix to append to all files output by the policy.
*/
public WindowedFilenamePolicy(
ValueProvider<String> prefix,
ValueProvider<String> shardTemplate,
ValueProvider<String> suffix) {
this.prefix = prefix;
this.shardTemplate = shardTemplate;
this.suffix = suffix;
}
/**
* The subdirectory policy will create sub-directories on the
* filesystem based on the window which has fired.
*
* @param policy The subdirectory policy to apply.
* @return The filename policy instance.
*/
public WindowedFilenamePolicy withSubDirectoryPolicy(SubDirectoryPolicy policy) {
return withSubDirectoryPolicy(StaticValueProvider.of(policy));
}
/**
* The subdirectory policy will create sub-directories on the
* filesystem based on the window which has fired.
*
* @param policy The subdirectory policy to apply.
* @return The filename policy instance.
*/
public WindowedFilenamePolicy withSubDirectoryPolicy(ValueProvider<SubDirectoryPolicy> policy) {
this.subDirectoryPolicy = policy;
return this;
}
/**
* The windowed filename method will construct filenames per window in the
* format of output-yyyyMMdd'T'HHmmss-001-of-100.txt.
*/
@Override
public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext c, String extension) {
Instant windowInstant = c.getWindow().maxTimestamp();
String datetimeStr = formatter.print(windowInstant.toDateTime());
// Remove the prefix when it is null so we don't append the literal 'null'
// to the start of the filename
String filenamePrefix = prefix.get() == null ? datetimeStr : prefix.get() + "-" + datetimeStr;
String filename = DefaultFilenamePolicy.constructName(
filenamePrefix,
shardTemplate.get(),
StringUtils.defaultIfBlank(suffix.get(), extension), // Ignore the extension in favor of the suffix.
c.getShardNumber(),
c.getNumShards());
String subDirectory = subDirectoryPolicy.get().format(windowInstant);
return outputDirectory
.resolve(subDirectory, StandardResolveOptions.RESOLVE_DIRECTORY)
.resolve(filename, StandardResolveOptions.RESOLVE_FILE);
}
/**
* Unwindowed writes are unsupported by this filename policy so an {@link UnsupportedOperationException}
* will be thrown if invoked.
*/
@Override
public ResourceId unwindowedFilename(ResourceId outputDirectory, Context c, String extension) {
throw new UnsupportedOperationException("There is no windowed filename policy for unwindowed file"
+ " output. Please use the WindowedFilenamePolicy with windowed writes or switch filename policies.");
}
}
答案 1 :(得分:1)
在Beam中,DefaultFilenamePolicy目前支持窗口写入,因此无需编写自定义FilenamePolicy。您可以通过在文件名模板中放置W和P占位符(分别用于窗口和窗格)来控制输出文件名。这存在于头梁存储库中,也将出现在即将发布的Beam 2.1版本中(正如我们所说的那样发布)。