我需要将一些来自MongoDB的数据带入spark工作中。我使用了来自 mongo-spark-connector_2.11 的spark mongo连接器。 写在代码下面并在spark-shell中运行它来测试
def createReadConfig(topic: String): ReadConfig = {
val user =UserId
val pass = Password
val host = Host
val db = Database
val coll = Collection
val partitioner = MongoPaginateBySizePartitioner
ReadConfig(Map("uri" -> ("mongodb://" + user + ":" + pass + "@" + host + "/" +
db), "database" -> db, "collection" -> coll, "partitioner" -> partitioner))
}
val collectionRDD= MongoSpark.load(sc,admissionConfig)
collectionRDD.filter(doc=>doc.getObjectId("_id")==new ObjectId("objectId")).count
获得结果需要20秒以上,而mongo控制台中的同一查询花费的时间不到一秒。
为什么会发生这种情况,如何降低速度差异呢?
答案 0 :(得分:2)
为什么会发生这种情况,如何降低速度差异呢?
不同之处在于,执行RDD.filter()
将数据从MongoDB加载到Spark工作者,然后执行filter
操作。与通过mongo shell
执行查询匹配相比,这可能需要更长的时间,具体取决于您的网络,数据大小,MongoDB服务器和Spark sorkers。
您可以利用MongoDB Connector for Spark的withPipeline
功能来利用它,例如:
val rdd = MongoSpark.load(sc)
val aggregatedRDD = rdd.withPipeline(Seq(Document.parse("{ $match: { '_id' : 'some id' } }")))
上面将在将文档传递给Spark之前过滤数据并在MongoDB中执行聚合。这减少了从MongoDB服务器到Spark工作人员的数据传输,还增加了利用数据库索引的能力。
答案 1 :(得分:0)
您可以配置mongodb查询以检查差异。
db.setProfilingLevel(2)
db.system.profile.find().limit(10).sort( { ts : -1 } ).pretty()
使用Spark RDD时,看起来整个集合都是从数据库中提取的:
{
"op" : "command",
"ns" : "test.scenter_inventory_center_sc_stock_sku",
"command" : {
"aggregate" : "scenter_inventory_center_sc_stock_sku",
"pipeline" : [ ],
"cursor" : {
},
"$db" : "test",
"$readPreference" : {
"mode" : "primaryPreferred"
}
},
"cursorid" : NumberLong("8629727736555097197"),
"keysExamined" : 0,
"docsExamined" : 311,
"numYield" : 2,
"locks" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(8)
}
},
"Database" : {
"acquireCount" : {
"r" : NumberLong(4)
}
},
"Collection" : {
"acquireCount" : {
"r" : NumberLong(4)
}
}
},
"nreturned" : 101,
"responseLength" : 26058,
"protocol" : "op_msg",
"millis" : 1,
"planSummary" : "COLLSCAN",
"ts" : ISODate("2018-08-28T06:23:45.089Z"),
"client" : "172.17.0.1",
"allUsers" : [ ],
"user" : ""
}
使用Mongo RDD时,查询中存在外观条件( pipeline
):
{
"op" : "command",
"ns" : "test.scenter_inventory_center_sc_stock_sku",
"command" : {
"aggregate" : "scenter_inventory_center_sc_stock_sku",
"pipeline" : [
{
"$match" : {
"warehouse_code" : {
"$eq" : "1"
}
}
}
],
"cursor" : {
},
"$db" : "test",
"$readPreference" : {
"mode" : "primaryPreferred"
}
},
"keysExamined" : 0,
"docsExamined" : 311,
"cursorExhausted" : true,
"numYield" : 2,
"locks" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(8)
}
},
"Database" : {
"acquireCount" : {
"r" : NumberLong(4)
}
},
"Collection" : {
"acquireCount" : {
"r" : NumberLong(4)
}
}
},
"nreturned" : 74,
"responseLength" : 19248,
"protocol" : "op_msg",
"millis" : 1,
"planSummary" : "COLLSCAN",
"ts" : ISODate("2018-08-28T06:23:53.735Z"),
"client" : "172.17.0.1",
"allUsers" : [ ],
"user" : ""
}