Question

我需要将一些来自MongoDB的数据带入spark工作中。我使用了来自 mongo-spark-connector_2.11 的spark mongo连接器。写在代码下面并在spark-shell中运行它来测试

def createReadConfig(topic: String): ReadConfig = {
    val user =UserId
    val pass = Password
    val host = Host
    val db = Database
    val coll = Collection
    val partitioner = MongoPaginateBySizePartitioner
    ReadConfig(Map("uri" -> ("mongodb://" + user + ":" + pass + "@" + host + "/" + 
    db), "database" -> db, "collection" -> coll, "partitioner" -> partitioner))
}


val collectionRDD= MongoSpark.load(sc,admissionConfig)

collectionRDD.filter(doc=>doc.getObjectId("_id")==new ObjectId("objectId")).count

获得结果需要20秒以上，而mongo控制台中的同一查询花费的时间不到一秒。

为什么会发生这种情况，如何降低速度差异呢？

Answer 1

为什么会发生这种情况，如何降低速度差异呢？

不同之处在于，执行RDD.filter()将数据从MongoDB加载到Spark工作者，然后执行filter操作。与通过mongo shell执行查询匹配相比，这可能需要更长的时间，具体取决于您的网络，数据大小，MongoDB服务器和Spark sorkers。

您可以利用MongoDB Connector for Spark的withPipeline功能来利用它，例如：

val rdd = MongoSpark.load(sc)

val aggregatedRDD = rdd.withPipeline(Seq(Document.parse("{ $match: { '_id' : 'some id' } }")))

上面将在将文档传递给Spark之前过滤数据并在MongoDB中执行聚合。这减少了从MongoDB服务器到Spark工作人员的数据传输，还增加了利用数据库索引的能力。

另见MongoDB Spark Connector: Filters and Aggregation

Answer 2

您可以配置mongodb查询以检查差异。

db.setProfilingLevel(2)
db.system.profile.find().limit(10).sort( { ts : -1 } ).pretty()

使用Spark RDD时，看起来整个集合都是从数据库中提取的：

{
    "op" : "command",
    "ns" : "test.scenter_inventory_center_sc_stock_sku",
    "command" : {
        "aggregate" : "scenter_inventory_center_sc_stock_sku",
        "pipeline" : [ ],
        "cursor" : {

        },
        "$db" : "test",
        "$readPreference" : {
            "mode" : "primaryPreferred"
        }
    },
    "cursorid" : NumberLong("8629727736555097197"),
    "keysExamined" : 0,
    "docsExamined" : 311,
    "numYield" : 2,
    "locks" : {
        "Global" : {
            "acquireCount" : {
                "r" : NumberLong(8)
            }
        },
        "Database" : {
            "acquireCount" : {
                "r" : NumberLong(4)
            }
        },
        "Collection" : {
            "acquireCount" : {
                "r" : NumberLong(4)
            }
        }
    },
    "nreturned" : 101,
    "responseLength" : 26058,
    "protocol" : "op_msg",
    "millis" : 1,
    "planSummary" : "COLLSCAN",
    "ts" : ISODate("2018-08-28T06:23:45.089Z"),
    "client" : "172.17.0.1",
    "allUsers" : [ ],
    "user" : ""
}

使用Mongo RDD时，查询中存在外观条件（ pipeline ）：

{
    "op" : "command",
    "ns" : "test.scenter_inventory_center_sc_stock_sku",
    "command" : {
        "aggregate" : "scenter_inventory_center_sc_stock_sku",
        "pipeline" : [
            {
                "$match" : {
                    "warehouse_code" : {
                        "$eq" : "1"
                    }
                }
            }
        ],
        "cursor" : {

        },
        "$db" : "test",
        "$readPreference" : {
            "mode" : "primaryPreferred"
        }
    },
    "keysExamined" : 0,
    "docsExamined" : 311,
    "cursorExhausted" : true,
    "numYield" : 2,
    "locks" : {
        "Global" : {
            "acquireCount" : {
                "r" : NumberLong(8)
            }
        },
        "Database" : {
            "acquireCount" : {
                "r" : NumberLong(4)
            }
        },
        "Collection" : {
            "acquireCount" : {
                "r" : NumberLong(4)
            }
        }
    },
    "nreturned" : 74,
    "responseLength" : 19248,
    "protocol" : "op_msg",
    "millis" : 1,
    "planSummary" : "COLLSCAN",
    "ts" : ISODate("2018-08-28T06:23:53.735Z"),
    "client" : "172.17.0.1",
    "allUsers" : [ ],
    "user" : ""
}

Spark mongo连接花费很长时间然后预期

2 个答案: