如何在Cloud Dataflow中保持与外部数据库的连接

时间:2015-10-18 15:16:10

标签: cassandra google-cloud-dataflow

我有一个从Pub / Sub读取的未绑定数据流管道,应用ParDo并写入Cassandra。它仅应用ParDo转换,因此即使源未绑定,我也使用默认触发的默认全局窗口。

在这样的管道中我应该如何保持与Cassandra的连接?

目前我将其保存在startBundle中:

private class CassandraWriter <T> extends DoFn<T, Void> {
  private transient Cluster cluster;
  private transient Session session;
  private transient MappingManager mappingManager;

  @Override
  public void startBundle(Context c) {
    this.cluster = Cluster.builder()
        .addContactPoints(hosts)
        .withPort(port)
        .withoutMetrics()
        .withoutJMXReporting()
        .build();
    this.session = cluster.connect(keyspace);
    this.mappingManager = new MappingManager(session);
  }

  @Override
  public void processElement(ProcessContext c) throws IOException {
    T element = c.element();
    Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
    mapper.save(element);
  }

  @Override
  public void finishBundle(Context c) throws IOException {
    session.close();
    cluster.close();
  }
}

但是,这样就为每个元素创建了一个新连接。

另一种选择是将其作为侧输入传递,如https://github.com/benjumanji/cassandra-dataflow

public PDone apply(PCollection<T> input) {
  Pipeline p = input.getPipeline();

  CassandraWriteOperation<T> op = new CassandraWriteOperation<T>(this);

  Coder<CassandraWriteOperation<T>> coder =
    (Coder<CassandraWriteOperation<T>>)SerializableCoder.of(op.getClass());

  PCollection<CassandraWriteOperation<T>> opSingleton =
    p.apply(Create.<CassandraWriteOperation<T>>of(op)).setCoder(coder);

  final PCollectionView<CassandraWriteOperation<T>> opSingletonView =
    opSingleton.apply(View.<CassandraWriteOperation<T>>asSingleton());

  PCollection<Void> results = input.apply(ParDo.of(new DoFn<T, Void>() {
    @Override
    public void processElement(ProcessContext c) throws Exception {
       // use the side input here
    }
  }).withSideInputs(opSingletonView));

  PCollectionView<Iterable<Void>> voidView = results.apply(View.<Void>asIterable());

  opSingleton.apply(ParDo.of(new DoFn<CassandraWriteOperation<T>, Void>() {
    private static final long serialVersionUID = 0;

    @Override
    public void processElement(ProcessContext c) {
      CassandraWriteOperation<T> op = c.element();
      op.finalize();
    }

  }).withSideInputs(voidView));

  return new PDone();
}

然而,这种方式我必须使用窗口,因为PCollectionView<Iterable<Void>> voidView = results.apply(View.<Void>asIterable());应用了一个组。

一般情况下,从无界PCollection写入外部数据库的PTransform如何保持与数据库的连接?

2 个答案:

答案 0 :(得分:2)

您正确地观察到,与批处理/有界情况相比,流/无界情况下的典型包大小更小。实际的包大小取决于许多参数,有时包可能包含单个元素。

解决此问题的一种方法是使用每个工作人员的连接池,以DoFn的静态状态存储。您应该能够在第一次调用startBundle时初始化它,并在捆绑包中使用它。或者,您可以按需创建连接,并在不再需要时将其释放到池中以供重用。

您应该确保静态静态是线程安全的,并且您没有假设Dataflow如何管理bundle。

答案 1 :(得分:1)

正如Davor Bonaci所说,使用静态变量解决了这个问题。

public class CassandraWriter<T> extends DoFn<T, Void> {
  private static final Logger log = LoggerFactory.getLogger(CassandraWriter.class);

  // Prevent multiple threads from creating multiple cluster connection in parallel.
  private static transient final Object lock = new Object();
  private static transient Cluster cluster;
  private static transient Session session;
  private static transient MappingManager mappingManager;

  private final String[] hosts;
  private final int port;
  private final String keyspace;

  public CassandraWriter(String[] hosts, int port, String keyspace) {
    this.hosts = hosts;
    this.port = port;
    this.keyspace = keyspace;
  }

  @Override
  public void startBundle(Context c) {
    synchronized (lock) {
      if (cluster == null) {
        cluster = Cluster.builder()
            .addContactPoints(hosts)
            .withPort(port)
            .withoutMetrics()
            .withoutJMXReporting()
            .build();
        session = cluster.connect(keyspace);
        mappingManager = new MappingManager(session);
      }
    }
  }

  @Override
  public void processElement(ProcessContext c) throws IOException {
    T element = c.element();
    Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
    mapper.save(element);
  }
}