我正在使用Google Cloud Dataflow进行一些转换
am从GBQ收集了大约300万条记录,然后执行转换并将转换结果写入GCS。
执行此操作时,数据流因错误而失败 错误: 在连续8次测量到的GC抖动之后,关闭JVM
工作流程失败。原因:S20:读取GBQ /重新排列.ViaRandomKey /重新排列/ GroupByKey /读取+读取GBQ /重新排列。值/映射+读取GBQ /读取文件+读取GBQ / PassThroughThenCleanup / ParMultiDo(Identity)+读取GBQ / PassThroughThenCleanup / View.AsIterable / ParDo(ToIsmRecordForGlobalWindow)+ transform + Split结果/ ParMultiDo(Partition)+写入错误/ WriteFiles / Rewind Window.Assign +写入错误/ WriteFiles / WriteShardedBundlesToTempFiles / ApplyShardingKey +写入错误/ WriteFiles / WriteShardedBundlesToTempFiles / GroupIntoShards / Reify + Write错误/ WriteFiles / WriteShardedBundlesToTempFiles / WriteTriteFiles / WriteTriteFiles / WriteTriteFiles / WriteTriteFiles / WriteTriteFiles / WriteTriteFiles / WriteTriteFiles / WriteWriteS / / WriteShardedBundlesToTempFiles / GroupIntoShards / Reify + Write实体Gzip / WriteFiles / WriteShardedBundlesToTempFiles / GroupIntoShards / Write失败。工作项目是尝试了4次,但没有成功。每次工人最终失去与服务的联系。在以下项目上尝试了该工作项:
DataConverterOptions选项= PipelineOptionsFactory.fromArgs(args).withValidation() .as(DataConverterOptions.class); 管道p = Pipeline.create(options);
EntityCreatorFn entityCreatorFn = EntityCreatorFn.newWithGCSMapping(options.getMapping(),
options.getWithUri(), options.getLineNumberToResult(), options.getIsPartialUpdate(), options.getQuery() != null);
PCollectionList<String> resultByType =
p.apply("Read GBQ", BigQueryIO.read(
(SchemaAndRecord elem) -> elem.getRecord().get("lineNumber") + "|" + elem.getRecord().get("sourceData"))
.fromQuery(options.getQuery()).withoutValidation()
.withCoder(StringUtf8Coder.of()).withTemplateCompatibility()).apply("transform",ParDo.of(entityCreatorFn))
.apply("Split results",Partition.of(2, (Partition.PartitionFn<String>) (elem, numPartitions) -> {
if (elem.startsWith(PREFIX_ERROR)) {
return PARTITION_ERROR;
}
return PARTITION_SUCCESS;
}));
FileIO.Sink sink = TextIO.sink();
resultByType.get(0).apply("Write entities Gzip", FileIO.write().to(options.getOutput()).withCompression(Compression.GZIP).withNumShards(options.getShards()).via(sink));
resultByType.get(1).apply("Write errors", TextIO.write().to(options.getErrorOutput()).withoutSharding());
p.run();
在连续8次测量到的GC抖动之后,关闭JVM。内存已使用/总/最大= 109/301/2507 MB,GC最后/最大= 54.00 / 54.00%,#pushbacks = 0,gc thrashing = true。
答案 0 :(得分:0)
'EntityCreatorFn.newWithGCSMapping'是否在内存中缓存元素?似乎管道中的步骤之一消耗了太多内存(请注意,Dataflow无法并行处理DoFn的单个元素的处理)。我建议调整您的管道或试用highmem机器。如果问题仍然存在,请考虑与Google Cloud Support联系并提供相关的工作ID等。