我正在尝试在批量上装时获取插入的_id
,并且对MongoSpark.save
方法进行了一些修改(原始内容为here),但出现了java.lang.IllegalStateException: state should be: open
异常,应该表明正在重新使用封闭的连接。
这是我的MongoSpark.save的修改版本,以获取BulkWriteResult:
object MongoSparkExt {
def save[D](dataset: Dataset[D], writeConfig: WriteConfig): Array[BulkWriteResult] = {
val mongoConnector = MongoConnector(writeConfig.asOptions)
val dataSet = dataset.toDF()
val mapper = rowToDocumentMapper(dataSet.schema)
val documentRdd: RDD[BsonDocument] = dataSet.rdd.map(row => mapper(row))
val fieldNames = dataset.schema.fieldNames.toList
val queryKeyList = BsonDocument.parse(writeConfig.shardKey.getOrElse("{_id: 1}")).keySet().asScala.toList
val resultRDD: RDD[BulkWriteResult] = documentRdd.mapPartitions(iter => if (iter.nonEmpty) {
mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[BsonDocument] =>
iter.grouped(writeConfig.maxBatchSize).map(batch => {
val requests = batch.map(doc =>
if (queryKeyList.forall(doc.containsKey(_))) {
// disabled ReplaceOneModel because it doesn't complie
val queryDocument = new BsonDocument()
queryKeyList.foreach(key => queryDocument.append(key, doc.get(key)))
queryDocument.keySet().asScala.foreach(doc.remove(_))
new UpdateOneModel[BsonDocument](queryDocument, new BsonDocument("$set", doc), new UpdateOptions().upsert(true))
} else {
new InsertOneModel[BsonDocument](doc)
})
collection.bulkWrite(requests.toList.asJava)
}).toIterator
})
} else Seq.empty[BulkWriteResult].toIterator )
resultRDD.collect // return an Array instead of RDD
}
}
直觉是upsert方法是返回我的数据集的_id
的完整列表,该列表是新插入的而不是未更新的。因此,考虑到_id
的列表可能很大,因此我尚未决定返回RDD或 BulkWriteResult 数组。我都尝试过,但例外仍然存在:
java.lang.IllegalStateException: state should be: open
at com.mongodb.assertions.Assertions.isTrue(Assertions.java:70)
at com.mongodb.internal.connection.BaseCluster.getDescription(BaseCluster.java:164)
at com.mongodb.internal.connection.SingleServerCluster.getDescription(SingleServerCluster.java:41)
at com.mongodb.client.internal.MongoClientDelegate.getConnectedClusterDescription(MongoClientDelegate.java:136)
at com.mongodb.client.internal.MongoClientDelegate.createClientSession(MongoClientDelegate.java:94)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.getClientSession(MongoClientDelegate.java:249)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:190)
at com.mongodb.client.internal.MongoCollectionImpl.executeBulkWrite(MongoCollectionImpl.java:467)
at com.mongodb.client.internal.MongoCollectionImpl.bulkWrite(MongoCollectionImpl.java:447)
... will paste more if necessary
在非常小的数据集(10行,但我重新划分为3行)上运行时,此方法有效,但在较大的数据集上运行时,则失败,只有第一批数据被上载到mongoDB,并抛出上述异常。
这个问题最初发布在jira.mongodb.org上,在mongo-spark-connector的作者建议之后,我决定将其移到这里。