Question

我有一个Spark Streaming应用程序，该应用程序从Kafka获取日志记录，并将所有访问日志记录插入mongodb。该应用程序在前几批中可以正常运行，但是在某些批处理之后，似乎有一项工作需要很长时间才能将数据插入到mongodb中。我想我的mongodb连接池配置应该有问题，但是我已经尝试过变化很多，没有晋升。

这是来自Web ui的结果：

time takes for each job

time takes for abnormal tasks

Spark：纱上的版本1.5.1（嗯，这可能真的太旧了。）

Mongodb：3.4.4版本在四台具有12个分片的计算机上运行，每台计算机具有160G +和40个CPU。 mongodb连接池的代码：

private MongoManager() {
        if (mongoClient == null) {
            MongoClientOptions.Builder build = new MongoClientOptions.Builder();
            build.connectionsPerHost(200);
            build.socketTimeout(1000);
            build.threadsAllowedToBlockForConnectionMultiplier(200);
            build.maxWaitTime(1000 * 60 * 2);
            build.connectTimeout(1000 * 60 * 1);
            build.writeConcern(WriteConcern.UNACKNOWLEDGED);
            MongoClientOptions myOptions = build.build();
            try {
                ServerAddress serverAddress1 = new ServerAddress(ip1, 20000);
                ServerAddress serverAddress2 = new ServerAddress(ip2, 20000);
                ServerAddress serverAddress3 = new ServerAddress(ip3, 20000);
                ServerAddress serverAddress4 = new ServerAddress(ip4, 20000);
                List<ServerAddress> lists = new ArrayList<>(8);
                lists.add(serverAddress1);
                lists.add(serverAddress2);
                lists.add(serverAddress3);
                lists.add(serverAddress4);
                mongoClient = new MongoClient(lists, myOptions);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public void inSertBatch(String dbName, String collectionName, List<DBObject> jsons) {
            if (jsons == null || jsons.isEmpty()) {
                return;
            }
            DB db = mongoClient.getDB(dbName);
            DBCollection dbCollection = db.getCollection(collectionName);
            dbCollection.insert(jsons);
        }

Spark流代码如下：

referDstream.foreachRDD(rdd => {
  rdd.foreachPartition(partition => {
    val records = partition.map(x => {
      val data = x._1.split("_")
      val dbObject: DBObject = new BasicDBObject()
      dbObject.put("xxx","xxx")
      ...
      dbObject
    }).toList
    val mg: MongoManager = MongoManager.getInstance()
    mg.inSertBatch("dbname", "colname", records.asJava)
  })
})

提交申请的脚本：

nohup ${SPARK_HOME}/bin/spark-submit --name ${jobname} --driver-cores 2 --driver-memory 8g 
--num-executors 20 --executor-memory 16g --executor-cores 4
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC" --conf "spark.shuffle.manager=hash" 
--conf "spark.shuffle.consolidateFiles=true" --driver-java-options "-XX:+UseConcMarkSweepGC" 
--master ${master} --class  ${APP_MAIN}  --jars ${jars_path:1}  ${APP_HOME}/${MAINJAR} ${sparkconf} &

从mongo shell获得的数据：

$ mongostat -h xxx.xxx.xxx.xxx:20000
insert query update delete getmore command flushes mapped vsize  res faults qrw arw net_in net_out conn                time
    *0    *0     *0     *0       0    14|0       0     0B 1.17G 514M      0 0|0 0|0   985b   19.4k   58 Dec  7 03:10:52.949
  2999    *0     *0     *0       0     8|0       0     0B 1.17G 514M      0 0|0 0|0   517b   17.6k   58 Dec  7 03:10:53.950
 15000    *0     *0     *0       0    19|0       0     0B 1.17G 514M      0 0|0 0|0   402b   17.2k   58 Dec  7 03:10:54.950
 17799    *0     *0     *0       0    22|0       0     0B 1.17G 514M      0 0|0 0|0  30.5m   16.9k   58 Dec  7 03:10:55.950
 15996    *0     *0     *0       0    18|0       0     0B 1.17G 514M      0 0|0 0|0   343b   16.9k   58 Dec  7 03:10:56.950
 12003    *0     *0     *0       0    26|0       0     0B 1.17G 514M      0 0|0 0|0   982b   19.3k   58 Dec  7 03:10:57.949
    *0    *0     *0     *0       0     6|0       0     0B 1.17G 514M      0 0|0 0|0   518b   17.6k   58 Dec  7 03:10:58.949
  4704    *0     *0     *0       0     8|0       0     0B 1.17G 514M      0 0|0 0|0  10.2m   17.1k   58 Dec  7 03:10:59.950
 34600    *0     *0     *0       0    64|0       0     0B 1.17G 526M      0 0|0 0|0  26.9m   16.9k   58 Dec  7 03:11:00.951
 33129    *0     *0     *0       0    36|0       0     0B 1.17G 526M      0 0|0 0|0   344b   17.0k   58 Dec  7 03:11:01.949

mongos> db.serverStatus().connections
{ "current" : 57, "available" : 19943, "totalCreated" : 2707 }

感谢您提供有关如何解决此问题的建议。

有些任务会将数据插入到mongodb中，这需要花费很长时间进行火花流处理

0 个答案: