当我们使用 MongoSpark 连接器使用 Spark 和 MongoDB 时,这是一个常见问题。此连接器旨在以批处理方式将文档插入/更新到MongoDB。使用Spark可以通过三种方式插入/更新文档。
数据集和数据框都支持使用MangoSpark.save()方法插入/更新文档,其中 RDD [Document] 仅支持insert。 所以我们在使用Mongo Spark时遇到了更新RDD [Document]的问题。
是否有任何解决方案可以使用Spark将RDD [Document]更新/替换为MongoDB。
答案 0 :(得分:1)
目前 Mongo Spark Connector 不支持更新/替换RDD [Document]。但是在Connector的帮助下,使用Apache Spark 更新/替换了Mongo Documents的RDD [Document] 的解决方法解决方案。
以下是使用示例数据进行更新/替换的示例代码:
db.people.find()
{" _id" :100,"名称" :"娜迦","年龄" :30,"地点" :"班加罗尔" }
{" _id" :101,"名称" :" Ravi","年龄" :33,"地点" :"班加罗尔" }
{" _id" :102,"名称" :" Hari","年龄" :23,"地点" :"迈索尔" }
val conf = new SparkConf().setAppName("Spark Mongo").setMaster("local[*]") val readOverrides = new HashMap[String, String]() readOverrides.put("spark.mongodb.input.uri", "mongodb://localhost:27017/info.people") val readConfig = ReadConfig.create(conf, readOverrides) val sc = new SparkContext(conf) val spark = SparkSession.builder().getOrCreate() val peopleRDD = MongoSpark.load(sc, readConfig) val updateRDD = peopleRDD.map { document => document.append("state", "karnataka") } val writeOverrides = new HashMap[String, String]() writeOverrides.put("spark.mongodb.output.uri", "mongodb://localhost:27017/info.people") writeOverrides.put("replaceDocument", "false") val writeConfig = WriteConfig.create(conf, writeOverrides) save(updateRDD, writeConfig) def save(rdd: RDD[Document], writeConfig: WriteConfig): Unit = { val mongoConnector = MongoConnector(writeConfig.asOptions) rdd.foreachPartition { partition => { if (partition.nonEmpty) { mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[Document] => { partition.foreach { document => { val searchDocument = new Document() searchDocument.append("_id", document.get("_id").asInstanceOf[Double]) collection.replaceOne(searchDocument, document) } } } }) } } } }
{" _id" :100,"名称" :"娜迦","年龄" :30,"地点" :"班加罗尔","州" :"卡纳塔克邦" }
{" _id" :101,"名称" :" Ravi","年龄" :33,"地点" :"班加罗尔","州" :"卡纳塔克邦" }
{" _id" :102,"名称" :" Hari","年龄" :23,"地点" :"迈索尔","州" :"卡纳塔克邦" }
此解决方案有效。