我们正在使用apache束将我们的其中一个猪管道转换为flink。 pig管道从hdfs中读取两个不同的数据集(R1和R2),对其进行充实,合并并将它们转储回hdfs。数据集R1倾斜。从某种意义上说,它很少有很多记录的密钥。当我们将清管器管道转换为apap梁并在生产纱线簇上使用flink运行它时,出现以下错误
2018-11-21 16:52:25,307 ERROR org.apache.flink.runtime.operators.BatchTask - Error in task code: GroupReduce (GroupReduce at CoGBK/GBK) (25/100)
java.lang.RuntimeException: Emitting the record caused an I/O exception: Failed to serialize element. Serialized size (> 1136656562 bytes) exceeds JVM heap space
at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:69)
at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)
at org.apache.beam.runners.flink.translation.functions.SortingFlinkCombineRunner.combine(SortingFlinkCombineRunner.java:140)
at org.apache.beam.runners.flink.translation.functions.FlinkReduceFunction.reduce(FlinkReduceFunction.java:85)
at org.apache.flink.api.java.operators.translation.PlanUnwrappingReduceGroupOperator$TupleUnwrappingNonCombinableGroupReducer.reduce(PlanUnwrappingReduceGroupOperator.java:111)
at org.apache.flink.runtime.operators.GroupReduceDriver.run(GroupReduceDriver.java:131)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to serialize element. Serialized size (> 1136656562 bytes) exceeds JVM heap space
at org.apache.flink.core.memory.DataOutputSerializer.resize(DataOutputSerializer.java:323)
at org.apache.flink.core.memory.DataOutputSerializer.write(DataOutputSerializer.java:149)
at org.apache.beam.runners.flink.translation.wrappers.DataOutputViewWrapper.write(DataOutputViewWrapper.java:48)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1286)
at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1577)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:351)
at org.apache.beam.sdk.coders.SerializableCoder.encode(SerializableCoder.java:170)
at org.apache.beam.sdk.coders.SerializableCoder.encode(SerializableCoder.java:50)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.transforms.join.UnionCoder.encode(UnionCoder.java:71)
at org.apache.beam.sdk.transforms.join.UnionCoder.encode(UnionCoder.java:58)
at org.apache.beam.sdk.transforms.join.UnionCoder.encode(UnionCoder.java:32)
at org.apache.beam.sdk.coders.IterableLikeCoder.encode(IterableLikeCoder.java:98)
at org.apache.beam.sdk.coders.IterableLikeCoder.encode(IterableLikeCoder.java:60)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:71)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:36)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:529)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:520)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:480)
at org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.serialize(CoderTypeSerializer.java:83)
at org.apache.flink.runtime.plugable.SerializationDelegate.write(SerializationDelegate.java:54)
at org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:88)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:131)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107)
at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)
... 9 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.flink.core.memory.DataOutputSerializer.resize(DataOutputSerializer.java:305)
at org.apache.flink.core.memory.DataOutputSerializer.write(DataOutputSerializer.java:149)
at org.apache.beam.runners.flink.translation.wrappers.DataOutputViewWrapper.write(DataOutputViewWrapper.java:48)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1286)
at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1577)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:351)
at org.apache.beam.sdk.coders.SerializableCoder.encode(SerializableCoder.java:170)
at org.apache.beam.sdk.coders.SerializableCoder.encode(SerializableCoder.java:50)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.transforms.join.UnionCoder.encode(UnionCoder.java:71)
at org.apache.beam.sdk.transforms.join.UnionCoder.encode(UnionCoder.java:58)
at org.apache.beam.sdk.transforms.join.UnionCoder.encode(UnionCoder.java:32)
at org.apache.beam.sdk.coders.IterableLikeCoder.encode(IterableLikeCoder.java:98)
at org.apache.beam.sdk.coders.IterableLikeCoder.encode(IterableLikeCoder.java:60)
at org.apache.beam.sdk.coders.Coder.encode(Coder.java:136)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:71)
at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:36)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:529)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:520)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:480)
at org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.serialize(CoderTypeSerializer.java:83)
at org.apache.flink.runtime.plugable.SerializationDelegate.write(SerializationDelegate.java:54)
at org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:88)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:131)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107)
at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)
从flink作业管理器仪表板的异常视图中,我们可以看到这是在联接操作中发生的。
当我说R1数据集倾斜时,有一些键的出现次数高达8,000,000,而大多数键只出现一次。
数据集R2的记录最多只能出现一次键。
另外,如果我们排除出现次数很高的此类键,则管道绝对可以正常运行,这证明它仅由于这几个键而发生。
Hadoop版本:2.7.1 光束校正:2.8.0 Flink Runner版本:2.8.0
让我知道我还应该获取更多信息,并在此处发布这些信息,以便您帮助我解决此问题。