使用Apache Spark更新/替换Mongo文档

时间:2017-11-04 18:26:56

标签: mongodb apache-spark rdd connector

当我们使用 MongoSpark 连接器使用 Spark MongoDB 时,这是一个常见问题。此连接器旨在以批处理方式将文档插入/更新到MongoDB。使用Spark可以通过三种方式插入/更新文档。

  1. RDD [文献]
  2. 数据帧[CaseClass]
  3. 数据集[CaseClass]
  4. 数据集和数据框都支持使用MangoSpark.save()方法插入/更新文档,其中 RDD [Document] 仅支持insert。 所以我们在使用Mongo Spark时遇到了更新RDD [Document]的问题。

    是否有任何解决方案可以使用Spark将RDD [Document]更新/替换为MongoDB。

1 个答案:

答案 0 :(得分:1)

目前 Mongo Spark Connector 不支持更新/替换RDD [Document]。但是在Connector的帮助下,使用Apache Spark 更新/替换了Mongo Documents的RDD [Document] 解决方法解决方案。

以下是使用示例数据进行更新/替换的示例代码:

db.people.find()

{" _id" :100,"名称" :"娜迦","年龄" :30,"地点" :"班加罗尔" }

{" _id" :101,"名称" :" Ravi","年龄" :33,"地点" :"班加罗尔" }

{" _id" :102,"名称" :" Hari","年龄" :23,"地点" :"迈索尔" }


    val conf = new SparkConf().setAppName("Spark Mongo").setMaster("local[*]")
      val readOverrides = new HashMap[String, String]()
      readOverrides.put("spark.mongodb.input.uri", "mongodb://localhost:27017/info.people")
      val readConfig = ReadConfig.create(conf, readOverrides)
      val sc = new SparkContext(conf)
      val spark = SparkSession.builder().getOrCreate()

      val peopleRDD = MongoSpark.load(sc, readConfig)
      val updateRDD = peopleRDD.map { document => document.append("state", "karnataka") }
      val writeOverrides = new HashMap[String, String]()
      writeOverrides.put("spark.mongodb.output.uri", "mongodb://localhost:27017/info.people")
      writeOverrides.put("replaceDocument", "false")
      val writeConfig = WriteConfig.create(conf, writeOverrides)
      save(updateRDD, writeConfig)

      def save(rdd: RDD[Document], writeConfig: WriteConfig): Unit = {
        val mongoConnector = MongoConnector(writeConfig.asOptions)
        rdd.foreachPartition { partition =>
          {
            if (partition.nonEmpty) {
              mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[Document] =>
                {
                  partition.foreach { document =>
                    {
                      val searchDocument = new Document()
                      searchDocument.append("_id", document.get("_id").asInstanceOf[Double])
                      collection.replaceOne(searchDocument, document)
                    }
                  }
                }
              })
            }
          }
        }
      }

{" _id" :100,"名称" :"娜迦","年龄" :30,"地点" :"班加罗尔","州" :"卡纳塔克邦" }

{" _id" :101,"名称" :" Ravi","年龄" :33,"地点" :"班加罗尔","州" :"卡纳塔克邦" }

{" _id" :102,"名称" :" Hari","年龄" :23,"地点" :"迈索尔","州" :"卡纳塔克邦" }

此解决方案有效。