在阅读this question之后,我仍然对Dataflow / Apache Beam如何分配工作负载有一些疑问。我遇到的问题可以通过以下代码进行演示:
package debug;
import java.io.IOException;
import org.apache.beam.runners.dataflow.DataflowRunner;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.runners.dataflow.options.DataflowPipelineWorkerPoolOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
public class DebugPipeline {
@SuppressWarnings("serial")
public static PipelineResult main(String[] args) throws IOException {
/*******************************************
* SETUP - Build options.
********************************************/
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(DataflowPipelineOptions.class);
options.setRunner(DataflowRunner.class);
options.setAutoscalingAlgorithm(
DataflowPipelineWorkerPoolOptions.AutoscalingAlgorithmType.THROUGHPUT_BASED);
// Autoscaling will scale between n/15 and n workers, so from 1-15 here
options.setMaxNumWorkers(15);
// Default of 250GB is absurdly high and we don't need that much on every worker
options.setDiskSizeGb(32);
// Manually configure scaling (i.e. 1 vs 5 for comparison)
options.setNumWorkers(5);
// Debug Pipeline
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply(PubsubIO.readStrings()
.fromSubscription("your subscription"))
// this is the transform that I actually care about. In production code, this will
// send a REST request to some 3rd party endpoint.
.apply("sleep", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) throws InterruptedException {
Thread.sleep(500);
c.output(c.element());
}
}));
return pipeline.run();
}
}
比较使用1个工人和5个工人时的最大吞吐量,而不是后者执行5倍的效率,它仅稍微提高了效率。这让我想知道以下问题:
asynchronous “job”
。这是否意味着每个DoFn实例都是异步处理的?Thread.sleep
替换为对第三方API的同步http请求。异步过程是否意味着它将同步客户端转换为异步客户端?更新
还有一个额外的问题: Dataflow documentation对PubSubIO的评论是:
在极端情况下(例如,具有大发布批次的Cloud Pub / Sub订阅或具有很高延迟的接收器),已知自动缩放会变得粗粒度。
您能否扩展以下内容:
大型出版批次是什么意思?即大批量还是大批量?
高延迟接收器是否在接收器之前的转换中包括高延迟?
粗粒度行为是什么?
答案 0 :(得分:1)