Question

我有一个Spark结构化的流，该流从几个繁忙的Kafka主题中读取UI事件。电流是这样的：

火花流读取Kafka偏移量
对于每个偏移量，它都会进入数据库，并将一个来自主题的值映射到另一个值。
汇总数据
将数据写入相同的数据库。

面对此问题，在运行10-12小时后，它会引发too many db connection open错误。它仅针对步骤2。

使用Aerospike数据库执行此Spark作业。有没有办法优化此流程？是否可以减少对数据库的调用？

读取数据：

sparkSession.readStream()
          .format("kafka")
          .option("kafka.bootstrap.servers", kafkaBootstrapServersString)
          .option("subscribe", newTopic)
          .option("startingOffsets", "latest")
          .option("enable.auto.commit", false)
          .option("failOnDataLoss", false)
          .load();

从数据库映射值并汇总数据：

dataset
        .map(
            new MapFunction<Row, Row>() {
              @Override
              public Row call(Row row) throws Exception {
                objects[1] = aerospikeDao.getSomeValueFromCode(row.getAs("code"));

                return new GenericRowWithSchema(objects, eventSpecificStructType);
              }
            },
            RowEncoder.apply(eventSpecificStructType)
        )
        .withWatermark("timestamp", "30 seconds")
        .select(
            col("timestamp"),
            col("platform"),
            col("some_value")
        )
        .groupBy(
            functions.window(col("timestamp"), "30 seconds"),
            col("platform"),
            col("some_value")
        )
        .agg(
            count(lit(1)).as("count")
        );

写入数据库：

aggregatedDataset
        .writeStream()
        .option("startingOffsets", "earliest")
        .outputMode(OutputMode.Append())
        .foreach(sink)
        .trigger(Trigger.ProcessingTime("30 seconds"))
        .start();

DAO：

    public AerospikeClient connect() {
        if (aerospikeClient == null || !aerospikeClient.isConnected()) {
          setAerospikeClient();
        }
        return this.aerospikeClient;
      }

 public void close() {
        if (aerospikeClient != null && aerospikeClient.isConnected()) {
            aerospikeClient.close();
        }
    }
public String getSomeValueFromCode(String code) {
    connet();
    Record record = aerospikeClient.get(Policy, key, "SomeValue");
    close();
    return channel;
  }

Spark流式传输：避免多次调用数据库

0 个答案: