Dataflow DynamicDestinations无法序列化org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite

时间:2017-09-11 23:54:37

标签: google-cloud-dataflow apache-beam

我正在尝试使用DynamicDestinations写入BigQuery中的分区表,其中分区名称为mytable $ yyyyMMdd 。如果我绕过dynamicdestinations并在.to()中提供硬编码的表名,它就可以工作;但是,对于dynamicdestinations,我得到以下异常:

java.lang.IllegalArgumentException: unable to serialize org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite$1@6fff253c
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:53)
at org.apache.beam.sdk.util.SerializableUtils.clone(SerializableUtils.java:90)
at org.apache.beam.sdk.transforms.ParDo$SingleOutput.<init>(ParDo.java:591)
at org.apache.beam.sdk.transforms.ParDo.of(ParDo.java:435)
at org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite.expand(PrepareWrite.java:51)
at org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite.expand(PrepareWrite.java:36)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:514)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:473)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:297)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped(BigQueryIO.java:987)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:972)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:659)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:514)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:454)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:284)
at com.homedepot.payments.monitoring.eventprocessor.MetricsAggregator.main(MetricsAggregator.java:82)
Caused by: java.io.NotSerializableException: com.google.api.services.bigquery.model.TableReference
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)

以下是代码:

PCollection<Event> rawEvents = pipeline
    .apply("ReadFromPubSub",
        PubsubIO.readProtos(EventOuterClass.Event.class)
                .fromSubscription(OPTIONS.getSubscription())
    )
    .apply("Parse", ParDo.of(new ParseFn()))
    .apply("ExtractAttributes", ParDo.of(new ExtractAttributesFn()));


EventTable table = new EventTable(OPTIONS.getProjectId(), OPTIONS.getMetricsDatasetId(), OPTIONS.getRawEventsTable());
rawEvents.apply(BigQueryIO.<Event>write()
    .to(new DynamicDestinations<Event, String>() {

        private static final long serialVersionUID = 1L;

        @Override
        public TableSchema getSchema(String destination) {
            return table.schema();
        }

        @Override
        public TableDestination getTable(String destination) {
            return new TableDestination(table.reference(), null);
        }

        @Override
        public String getDestination(ValueInSingleWindow<Event> element) {
            String dayString = DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC).toString();
            return table.reference().getTableId() + "$" + dayString;
        }
    })
    .withFormatFunction(new SerializableFunction<Event, TableRow>() {
        public TableRow apply(Event event) {
            TableRow row = new TableRow();
            Event evnt = (Event) event;
            row.set(EventTable.Field.VERSION.getName(), evnt.getVersion());
            row.set(EventTable.Field.TIMESTAMP.getName(), evnt.getTimestamp() / 1000);
            row.set(EventTable.Field.EVENT_TYPE_ID.getName(), evnt.getEventTypeId());
            row.set(EventTable.Field.EVENT_ID.getName(), evnt.getId());
            row.set(EventTable.Field.LOCATION.getName(), evnt.getLocation());
            row.set(EventTable.Field.SERVICE.getName(), evnt.getService());
            row.set(EventTable.Field.HOST.getName(), evnt.getHost());
            row.set(EventTable.Field.BODY.getName(), evnt.getBody());
            return row;
        }
    })
    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);

任何指向正确方向的人都会非常感激。 谢谢!

2 个答案:

答案 0 :(得分:2)

通过检查异常消息和上面的代码,您的匿名EventTable类中使用的DynamicDestinations字段似乎包含一个不可序列化的TableReference字段。

一种解决方法是将匿名DynamicDestinations转换为静态内部类,并定义一个构造函数,该构造函数仅存储实现接口所需的EventTable的可序列化部分。

例如:

private static class EventDestinations extends DynamicDestinations<Event, String> {
  private final TableSchema schema;
  private final TableDestination destination;
  private final String tableId;

  private EventDestinations(EventTable table) {
    this.schema = table.schema();
    this.destination = new TableDestination(table.reference(), null);
    this.tableId = table.reference().getTableId();
  }

  // ..
}

答案 1 :(得分:0)

看起来您正在尝试根据事件填充特定分区。为什么不使用:

SerializableFunction<ValueInSingleWindow<Event>, TableDestination>