数据流DoFn不会使用PipelineOptions进行序列化

时间:2017-07-17 19:03:27

标签: google-cloud-platform google-cloud-dataflow

我正在尝试将PipelineOptions接口传递给数据流DoFn,以便DoFn可以配置它需要重新实例化的一些不可序列化的东西,但是当我告诉它持有一个实例时,似乎Dataflow无法序列化DoFn我的PipelineOptions子类。我是否需要对Options接口执行某些操作才能使其正确序列化?

我知道这是编写自定义序列化+反序列化代码的选项(如https://gist.github.com/jlewi/f1cd323dc88bd58601efHow to fix Dataflow unable to serialize my DoFn?),但似乎PipelineOptions类明确表示它应该是可序列化的,我会我更喜欢不在每个使用此选项对象的DoFn中编写序列化和反序列化代码。

选项类代码段:

public interface Options 
extends BigtableOptions, BigtableScanOptions, OfflineModuleOptions, Serializable {...}

DoFn定义

public class RunEventGeneratorsDoFn extends DoFn<...,...> {
    private OfflinePipelineRunner.Options options;
....
}

选项未标记为transient

时的序列化异常
Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize [my DoFn]
    at com.google.cloud.dataflow.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:54)
    at com.google.cloud.dataflow.sdk.util.SerializableUtils.clone(SerializableUtils.java:91)
    at com.google.cloud.dataflow.sdk.transforms.ParDo$Bound.<init>(ParDo.java:720)
    at com.google.cloud.dataflow.sdk.transforms.ParDo$Unbound.of(ParDo.java:678)
    at com.google.cloud.dataflow.sdk.transforms.ParDo$Unbound.access$000(ParDo.java:596)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.of(ParDo.java:563)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.of(ParDo.java:558)
    at [dofn instantiation line]
Caused by: java.io.NotSerializableException: com.google.cloud.dataflow.sdk.options.ProxyInvocationHandler
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at com.google.cloud.dataflow.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:50)
    ... 7 more

1 个答案:

答案 0 :(得分:2)

实际的管道选项对象不应包含在特定DoFnPTransform中的字段中。而是传递您要访问的特定选项的值。

有关更多背景信息,请参阅此问题“How to get PipelineOptions in composite PTransform in Beam 2.0?”。