我有一个Spark结构化的流,该流从几个繁忙的Kafka主题中读取UI事件。 电流是这样的:
面对此问题,在运行10-12小时后,它会引发too many db connection open
错误。它仅针对步骤2。
使用Aerospike数据库执行此Spark作业。 有没有办法优化此流程?是否可以减少对数据库的调用?
读取数据:
sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServersString)
.option("subscribe", newTopic)
.option("startingOffsets", "latest")
.option("enable.auto.commit", false)
.option("failOnDataLoss", false)
.load();
从数据库映射值并汇总数据:
dataset
.map(
new MapFunction<Row, Row>() {
@Override
public Row call(Row row) throws Exception {
objects[1] = aerospikeDao.getSomeValueFromCode(row.getAs("code"));
return new GenericRowWithSchema(objects, eventSpecificStructType);
}
},
RowEncoder.apply(eventSpecificStructType)
)
.withWatermark("timestamp", "30 seconds")
.select(
col("timestamp"),
col("platform"),
col("some_value")
)
.groupBy(
functions.window(col("timestamp"), "30 seconds"),
col("platform"),
col("some_value")
)
.agg(
count(lit(1)).as("count")
);
写入数据库:
aggregatedDataset
.writeStream()
.option("startingOffsets", "earliest")
.outputMode(OutputMode.Append())
.foreach(sink)
.trigger(Trigger.ProcessingTime("30 seconds"))
.start();
DAO:
public AerospikeClient connect() {
if (aerospikeClient == null || !aerospikeClient.isConnected()) {
setAerospikeClient();
}
return this.aerospikeClient;
}
public void close() {
if (aerospikeClient != null && aerospikeClient.isConnected()) {
aerospikeClient.close();
}
}
public String getSomeValueFromCode(String code) {
connet();
Record record = aerospikeClient.get(Policy, key, "SomeValue");
close();
return channel;
}