我有一个Spark Streaming应用程序,该应用程序从Kafka获取日志记录,并将所有访问日志记录插入mongodb。该应用程序在前几批中可以正常运行,但是在某些批处理之后,似乎有一项工作需要很长时间才能将数据插入到mongodb中。我想我的mongodb连接池配置应该有问题,但是我已经尝试过变化很多,没有晋升。
这是来自Web ui的结果:
Spark:纱上的版本1.5.1(嗯,这可能真的太旧了。)
Mongodb:3.4.4版本在四台具有12个分片的计算机上运行,每台计算机具有160G +和40个CPU。 mongodb连接池的代码:
private MongoManager() {
if (mongoClient == null) {
MongoClientOptions.Builder build = new MongoClientOptions.Builder();
build.connectionsPerHost(200);
build.socketTimeout(1000);
build.threadsAllowedToBlockForConnectionMultiplier(200);
build.maxWaitTime(1000 * 60 * 2);
build.connectTimeout(1000 * 60 * 1);
build.writeConcern(WriteConcern.UNACKNOWLEDGED);
MongoClientOptions myOptions = build.build();
try {
ServerAddress serverAddress1 = new ServerAddress(ip1, 20000);
ServerAddress serverAddress2 = new ServerAddress(ip2, 20000);
ServerAddress serverAddress3 = new ServerAddress(ip3, 20000);
ServerAddress serverAddress4 = new ServerAddress(ip4, 20000);
List<ServerAddress> lists = new ArrayList<>(8);
lists.add(serverAddress1);
lists.add(serverAddress2);
lists.add(serverAddress3);
lists.add(serverAddress4);
mongoClient = new MongoClient(lists, myOptions);
} catch (Exception e) {
e.printStackTrace();
}
}
}
public void inSertBatch(String dbName, String collectionName, List<DBObject> jsons) {
if (jsons == null || jsons.isEmpty()) {
return;
}
DB db = mongoClient.getDB(dbName);
DBCollection dbCollection = db.getCollection(collectionName);
dbCollection.insert(jsons);
}
Spark流代码如下:
referDstream.foreachRDD(rdd => {
rdd.foreachPartition(partition => {
val records = partition.map(x => {
val data = x._1.split("_")
val dbObject: DBObject = new BasicDBObject()
dbObject.put("xxx","xxx")
...
dbObject
}).toList
val mg: MongoManager = MongoManager.getInstance()
mg.inSertBatch("dbname", "colname", records.asJava)
})
})
提交申请的脚本:
nohup ${SPARK_HOME}/bin/spark-submit --name ${jobname} --driver-cores 2 --driver-memory 8g
--num-executors 20 --executor-memory 16g --executor-cores 4
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC" --conf "spark.shuffle.manager=hash"
--conf "spark.shuffle.consolidateFiles=true" --driver-java-options "-XX:+UseConcMarkSweepGC"
--master ${master} --class ${APP_MAIN} --jars ${jars_path:1} ${APP_HOME}/${MAINJAR} ${sparkconf} &
从mongo shell获得的数据:
$ mongostat -h xxx.xxx.xxx.xxx:20000
insert query update delete getmore command flushes mapped vsize res faults qrw arw net_in net_out conn time
*0 *0 *0 *0 0 14|0 0 0B 1.17G 514M 0 0|0 0|0 985b 19.4k 58 Dec 7 03:10:52.949
2999 *0 *0 *0 0 8|0 0 0B 1.17G 514M 0 0|0 0|0 517b 17.6k 58 Dec 7 03:10:53.950
15000 *0 *0 *0 0 19|0 0 0B 1.17G 514M 0 0|0 0|0 402b 17.2k 58 Dec 7 03:10:54.950
17799 *0 *0 *0 0 22|0 0 0B 1.17G 514M 0 0|0 0|0 30.5m 16.9k 58 Dec 7 03:10:55.950
15996 *0 *0 *0 0 18|0 0 0B 1.17G 514M 0 0|0 0|0 343b 16.9k 58 Dec 7 03:10:56.950
12003 *0 *0 *0 0 26|0 0 0B 1.17G 514M 0 0|0 0|0 982b 19.3k 58 Dec 7 03:10:57.949
*0 *0 *0 *0 0 6|0 0 0B 1.17G 514M 0 0|0 0|0 518b 17.6k 58 Dec 7 03:10:58.949
4704 *0 *0 *0 0 8|0 0 0B 1.17G 514M 0 0|0 0|0 10.2m 17.1k 58 Dec 7 03:10:59.950
34600 *0 *0 *0 0 64|0 0 0B 1.17G 526M 0 0|0 0|0 26.9m 16.9k 58 Dec 7 03:11:00.951
33129 *0 *0 *0 0 36|0 0 0B 1.17G 526M 0 0|0 0|0 344b 17.0k 58 Dec 7 03:11:01.949
mongos> db.serverStatus().connections
{ "current" : 57, "available" : 19943, "totalCreated" : 2707 }
感谢您提供有关如何解决此问题的建议。