我正在研究Apache Beam流。我制作了一个流,该流读取很多主题并将所有数据放入GCS。
我的KafkaIO.reader是
KafkaIO.<String, AvroGenericRecord>read()
.withBootstrapServers(bootstrapServers)
.withConsumerConfigUpdates(configUpdates)
.withTopics(inputTopics)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder(BeamKafkaAvroGenericDeserializer.class, AvroGenericCoder.of(serDeConfig()))
.withMaxNumRecords(maxNumRecords)
.commitOffsetsInFinalize()
.withoutMetadata();
在 configUpdates 中,我输入了 ConsumerConfig.GROUP_ID_CONFIG 值。
我想以某种方式阅读2-3个消费群体,这有可能实现吗?因为我有一些主题,哪些数据很快出现,但有些却没有。
UPD
我想建立多个消费群体的原因是我的工作记忆不足。
gcp#3|Caused by: java.lang.OutOfMemoryError: Java heap space
gcp#3|java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.OutOfMemoryError: Java heap space
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:184)
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:102)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1057)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:438)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:125)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1060)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:930)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.processElements(ReduceFnRunner.java:368)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:94)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:42)
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:115)
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:80)
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1316)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1049)
gcp#3| java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
gcp#3| java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
gcp#3| java.lang.Thread.run(Thread.java:745)
gcp#3|Caused by: org.apache.beam.sdk.util.UserCodeException: java.lang.OutOfMemoryError: Java heap space
gcp#3| org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:34)
gcp#3| org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:218)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:180)
gcp#3| org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:335)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:102)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1057)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:438)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:125)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1060)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:930)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.processElements(ReduceFnRunner.java:368)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:94)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:42)
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:115)
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
gcp#3| org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:80)
gcp#3| org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
gcp#3| org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1316)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149)
gcp#3| org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1049)
gcp#3| java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
gcp#3| java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
gcp#3| java.lang.Thread.run(Thread.java:745)
gcp#3|Caused by: java.lang.OutOfMemoryError: Java heap space
据我所知,问题不在于从卡夫卡阅读,而是因为我认为不正确的窗口化。我有很多主题(超过40个),并且我尝试阅读所有主题和大量数据...我尝试使事件时间窗口化以处理所有问题。
这是我的窗口:
records.apply(Window.<AvroGenericRecord>into(FixedWindows.of(Duration.standardHours(options.getWindowInMinutes())))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane())
.withLateFirings(AfterPane.elementCountAtLeast(options.getElementsCountToWaitAfterWatermark())))
.withAllowedLateness(Duration.standardHours(1))
.discardingFiredPanes()
UPD 2.0
我认为这是在写作过程中发生的。
这是我的课程,将Avro数据放入GCP存储桶中。它应该按主题名称和时间戳放置数据。最终输出应为 bucket / {topic} / {date} / {'avroContainerPerWindowOrPane'}
这就是我的方法。
public class DynamicAvroGenericRecordDestinations extends DynamicAvroDestinations<AvroGenericRecord, AvroDestination, GenericRecord> {
private static final DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");
private final String baseDir;
private final String fileExtension;
public DynamicAvroGenericRecordDestinations(String baseDir, String fileExtension) {
this.baseDir = baseDir;
this.fileExtension = fileExtension;
}
@Override
public Schema getSchema(AvroDestination destination) {
return new Schema.Parser().parse(destination.jsonSchema);
}
@Override
public GenericRecord formatRecord(AvroGenericRecord record) {
return record.getRecord();
}
@Override
public AvroDestination getDestination(AvroGenericRecord record) {
Schema schema = record.getRecord().getSchema();
return AvroDestination.of(record.getName(), record.getDate(), record.getVersionId(), schema.toString());
}
@Override
public AvroDestination getDefaultDestination() {
return new AvroDestination();
}
@Override
public FileBasedSink.FilenamePolicy getFilenamePolicy(AvroDestination destination) {
String pathStr = baseDir + "/" + destination.name + "/" + destination.date + "/" + destination.name;
return new WindowedFilenamePolicy(FileBasedSink.convertToFileResourceIfPossible(pathStr), destination.version, fileExtension);
}
private static class WindowedFilenamePolicy extends FileBasedSink.FilenamePolicy {
final ResourceId outputFilePrefix;
final String fileExtension;
final Integer version;
WindowedFilenamePolicy(ResourceId outputFilePrefix, Integer version, String fileExtension) {
this.outputFilePrefix = outputFilePrefix;
this.version = version;
this.fileExtension = fileExtension;
}
@Override
public ResourceId windowedFilename(
int shardNumber,
int numShards,
BoundedWindow window,
PaneInfo paneInfo,
FileBasedSink.OutputFileHints outputFileHints) {
IntervalWindow intervalWindow = (IntervalWindow) window;
String filenamePrefix =
outputFilePrefix.isDirectory() ? "" : firstNonNull(outputFilePrefix.getFilename(), "");
String filename =
String.format("%s-%s(%s-%s)-(%s-of-%s)%s", filenamePrefix,
version,
formatter.print(intervalWindow.start()),
formatter.print(intervalWindow.end()),
shardNumber,
numShards - 1,
fileExtension);
ResourceId result = outputFilePrefix.getCurrentDirectory();
return result.resolve(filename, RESOLVE_FILE);
}
@Override
public ResourceId unwindowedFilename(
int shardNumber, int numShards, FileBasedSink.OutputFileHints outputFileHints) {
throw new UnsupportedOperationException("Expecting windowed outputs only");
}
@Override
public void populateDisplayData(DisplayData.Builder builder) {
builder.add(
DisplayData.item("fileNamePrefix", outputFilePrefix.toString())
.withLabel("File Name Prefix"));
}
}
}
答案 0 :(得分:0)
我认为KafkaIO
不允许在同一Kafka读取转换中具有不同的GROUP_ID。好吧,我们允许两个不同的使用者配置,但这是因为在引擎盖下,KafkaIO
中实际上有两个使用者-用于消息和偏移,所以这是一个不同的故事。
顺便说一句,在您的情况下,以不同的频率使用来自主题的消息是什么问题?