Spark mongo连接花费很长时间然后预期

时间:2017-08-24 22:43:01

标签: mongodb scala apache-spark

我需要将一些来自MongoDB的数据带入spark工作中。我使用了来自 mongo-spark-connector_2.11 的spark mongo连接器。 写在代码下面并在spark-shell中运行它来测试

def createReadConfig(topic: String): ReadConfig = {
    val user =UserId
    val pass = Password
    val host = Host
    val db = Database
    val coll = Collection
    val partitioner = MongoPaginateBySizePartitioner
    ReadConfig(Map("uri" -> ("mongodb://" + user + ":" + pass + "@" + host + "/" + 
    db), "database" -> db, "collection" -> coll, "partitioner" -> partitioner))
}


val collectionRDD= MongoSpark.load(sc,admissionConfig)

collectionRDD.filter(doc=>doc.getObjectId("_id")==new ObjectId("objectId")).count

获得结果需要20秒以上,而mongo控制台中的同一查询花费的时间不到一秒。

为什么会发生这种情况,如何降低速度差异呢?

2 个答案:

答案 0 :(得分:2)

  

为什么会发生这种情况,如何降低速度差异呢?

不同之处在于,执行RDD.filter()将数据从MongoDB加载到Spark工作者,然后执行filter操作。与通过mongo shell执行查询匹配相比,这可能需要更长的时间,具体取决于您的网络,数据大小,MongoDB服务器和Spark sorkers。

您可以利用MongoDB Connector for Spark的withPipeline功能来利用它,例如:

val rdd = MongoSpark.load(sc)

val aggregatedRDD = rdd.withPipeline(Seq(Document.parse("{ $match: { '_id' : 'some id' } }")))

上面将在将文档传递给Spark之前过滤数据并在MongoDB中执行聚合。这减少了从MongoDB服务器到Spark工作人员的数据传输,还增加了利用数据库索引的能力。

另见MongoDB Spark Connector: Filters and Aggregation

答案 1 :(得分:0)

您可以配置mongodb查询以检查差异。

db.setProfilingLevel(2)
db.system.profile.find().limit(10).sort( { ts : -1 } ).pretty()

使用Spark RDD时,看起来整个集合都是从数据库中提取的:

{
    "op" : "command",
    "ns" : "test.scenter_inventory_center_sc_stock_sku",
    "command" : {
        "aggregate" : "scenter_inventory_center_sc_stock_sku",
        "pipeline" : [ ],
        "cursor" : {

        },
        "$db" : "test",
        "$readPreference" : {
            "mode" : "primaryPreferred"
        }
    },
    "cursorid" : NumberLong("8629727736555097197"),
    "keysExamined" : 0,
    "docsExamined" : 311,
    "numYield" : 2,
    "locks" : {
        "Global" : {
            "acquireCount" : {
                "r" : NumberLong(8)
            }
        },
        "Database" : {
            "acquireCount" : {
                "r" : NumberLong(4)
            }
        },
        "Collection" : {
            "acquireCount" : {
                "r" : NumberLong(4)
            }
        }
    },
    "nreturned" : 101,
    "responseLength" : 26058,
    "protocol" : "op_msg",
    "millis" : 1,
    "planSummary" : "COLLSCAN",
    "ts" : ISODate("2018-08-28T06:23:45.089Z"),
    "client" : "172.17.0.1",
    "allUsers" : [ ],
    "user" : ""
}

使用Mongo RDD时,查询中存在外观条件( pipeline ):

{
    "op" : "command",
    "ns" : "test.scenter_inventory_center_sc_stock_sku",
    "command" : {
        "aggregate" : "scenter_inventory_center_sc_stock_sku",
        "pipeline" : [
            {
                "$match" : {
                    "warehouse_code" : {
                        "$eq" : "1"
                    }
                }
            }
        ],
        "cursor" : {

        },
        "$db" : "test",
        "$readPreference" : {
            "mode" : "primaryPreferred"
        }
    },
    "keysExamined" : 0,
    "docsExamined" : 311,
    "cursorExhausted" : true,
    "numYield" : 2,
    "locks" : {
        "Global" : {
            "acquireCount" : {
                "r" : NumberLong(8)
            }
        },
        "Database" : {
            "acquireCount" : {
                "r" : NumberLong(4)
            }
        },
        "Collection" : {
            "acquireCount" : {
                "r" : NumberLong(4)
            }
        }
    },
    "nreturned" : 74,
    "responseLength" : 19248,
    "protocol" : "op_msg",
    "millis" : 1,
    "planSummary" : "COLLSCAN",
    "ts" : ISODate("2018-08-28T06:23:53.735Z"),
    "client" : "172.17.0.1",
    "allUsers" : [ ],
    "user" : ""
}