我有一个从Pub / Sub读取的未绑定数据流管道,应用ParDo并写入Cassandra。它仅应用ParDo转换,因此即使源未绑定,我也使用默认触发的默认全局窗口。
在这样的管道中我应该如何保持与Cassandra的连接?
目前我将其保存在startBundle
中:
private class CassandraWriter <T> extends DoFn<T, Void> {
private transient Cluster cluster;
private transient Session session;
private transient MappingManager mappingManager;
@Override
public void startBundle(Context c) {
this.cluster = Cluster.builder()
.addContactPoints(hosts)
.withPort(port)
.withoutMetrics()
.withoutJMXReporting()
.build();
this.session = cluster.connect(keyspace);
this.mappingManager = new MappingManager(session);
}
@Override
public void processElement(ProcessContext c) throws IOException {
T element = c.element();
Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
mapper.save(element);
}
@Override
public void finishBundle(Context c) throws IOException {
session.close();
cluster.close();
}
}
但是,这样就为每个元素创建了一个新连接。
另一种选择是将其作为侧输入传递,如https://github.com/benjumanji/cassandra-dataflow:
public PDone apply(PCollection<T> input) {
Pipeline p = input.getPipeline();
CassandraWriteOperation<T> op = new CassandraWriteOperation<T>(this);
Coder<CassandraWriteOperation<T>> coder =
(Coder<CassandraWriteOperation<T>>)SerializableCoder.of(op.getClass());
PCollection<CassandraWriteOperation<T>> opSingleton =
p.apply(Create.<CassandraWriteOperation<T>>of(op)).setCoder(coder);
final PCollectionView<CassandraWriteOperation<T>> opSingletonView =
opSingleton.apply(View.<CassandraWriteOperation<T>>asSingleton());
PCollection<Void> results = input.apply(ParDo.of(new DoFn<T, Void>() {
@Override
public void processElement(ProcessContext c) throws Exception {
// use the side input here
}
}).withSideInputs(opSingletonView));
PCollectionView<Iterable<Void>> voidView = results.apply(View.<Void>asIterable());
opSingleton.apply(ParDo.of(new DoFn<CassandraWriteOperation<T>, Void>() {
private static final long serialVersionUID = 0;
@Override
public void processElement(ProcessContext c) {
CassandraWriteOperation<T> op = c.element();
op.finalize();
}
}).withSideInputs(voidView));
return new PDone();
}
然而,这种方式我必须使用窗口,因为PCollectionView<Iterable<Void>> voidView = results.apply(View.<Void>asIterable());
应用了一个组。
一般情况下,从无界PCollection写入外部数据库的PTransform如何保持与数据库的连接?
答案 0 :(得分:2)
您正确地观察到,与批处理/有界情况相比,流/无界情况下的典型包大小更小。实际的包大小取决于许多参数,有时包可能包含单个元素。
解决此问题的一种方法是使用每个工作人员的连接池,以DoFn
的静态状态存储。您应该能够在第一次调用startBundle
时初始化它,并在捆绑包中使用它。或者,您可以按需创建连接,并在不再需要时将其释放到池中以供重用。
您应该确保静态静态是线程安全的,并且您没有假设Dataflow如何管理bundle。
答案 1 :(得分:1)
正如Davor Bonaci所说,使用静态变量解决了这个问题。
public class CassandraWriter<T> extends DoFn<T, Void> {
private static final Logger log = LoggerFactory.getLogger(CassandraWriter.class);
// Prevent multiple threads from creating multiple cluster connection in parallel.
private static transient final Object lock = new Object();
private static transient Cluster cluster;
private static transient Session session;
private static transient MappingManager mappingManager;
private final String[] hosts;
private final int port;
private final String keyspace;
public CassandraWriter(String[] hosts, int port, String keyspace) {
this.hosts = hosts;
this.port = port;
this.keyspace = keyspace;
}
@Override
public void startBundle(Context c) {
synchronized (lock) {
if (cluster == null) {
cluster = Cluster.builder()
.addContactPoints(hosts)
.withPort(port)
.withoutMetrics()
.withoutJMXReporting()
.build();
session = cluster.connect(keyspace);
mappingManager = new MappingManager(session);
}
}
}
@Override
public void processElement(ProcessContext c) throws IOException {
T element = c.element();
Mapper<T> mapper = (Mapper<T>) mappingManager.mapper(element.getClass());
mapper.save(element);
}
}