mongo-spark-connector:尝试在Upsert之后获取BulkWriteResult时,意外的连接关闭

时间:2018-09-27 15:40:46

标签: mongodb apache-spark

我正在尝试在批量上装时获取插入的_id,并且对MongoSpark.save方法进行了一些修改(原始内容为here),但出现了java.lang.IllegalStateException: state should be: open异常,应该表明正在重新使用封闭的连接。

这是我的MongoSpark.save的修改版本,以获取BulkWriteResult:

object MongoSparkExt {

  def save[D](dataset: Dataset[D], writeConfig: WriteConfig): Array[BulkWriteResult] = {
    val mongoConnector = MongoConnector(writeConfig.asOptions)
    val dataSet = dataset.toDF()
    val mapper = rowToDocumentMapper(dataSet.schema)
    val documentRdd: RDD[BsonDocument] = dataSet.rdd.map(row => mapper(row))
    val fieldNames = dataset.schema.fieldNames.toList
    val queryKeyList = BsonDocument.parse(writeConfig.shardKey.getOrElse("{_id: 1}")).keySet().asScala.toList

    val resultRDD: RDD[BulkWriteResult] = documentRdd.mapPartitions(iter => if (iter.nonEmpty) {
      mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[BsonDocument] =>
        iter.grouped(writeConfig.maxBatchSize).map(batch => {
          val requests = batch.map(doc =>
            if (queryKeyList.forall(doc.containsKey(_))) {
              // disabled ReplaceOneModel because it doesn't complie
              val queryDocument = new BsonDocument()
              queryKeyList.foreach(key => queryDocument.append(key, doc.get(key)))
                queryDocument.keySet().asScala.foreach(doc.remove(_))
                new UpdateOneModel[BsonDocument](queryDocument, new BsonDocument("$set", doc), new UpdateOptions().upsert(true))
            } else {
              new InsertOneModel[BsonDocument](doc)
            })
          collection.bulkWrite(requests.toList.asJava)
        }).toIterator
      })
    } else Seq.empty[BulkWriteResult].toIterator )
    resultRDD.collect // return an Array instead of RDD
  }

}

直觉是upsert方法是返回我的数据集的_id的完整列表,该列表是新插入的而不是未更新的。因此,考虑到_id的列表可能很大,因此我尚未决定返回RDD或 BulkWriteResult 数组。我都尝试过,但例外仍然存在:

java.lang.IllegalStateException: state should be: open                                                                             
        at com.mongodb.assertions.Assertions.isTrue(Assertions.java:70)                                                            
        at com.mongodb.internal.connection.BaseCluster.getDescription(BaseCluster.java:164)                                        
        at com.mongodb.internal.connection.SingleServerCluster.getDescription(SingleServerCluster.java:41)  
        at com.mongodb.client.internal.MongoClientDelegate.getConnectedClusterDescription(MongoClientDelegate.java:136)            
        at com.mongodb.client.internal.MongoClientDelegate.createClientSession(MongoClientDelegate.java:94)                        
        at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.getClientSession(MongoClientDelegate.java:249)
        at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:190)
        at com.mongodb.client.internal.MongoCollectionImpl.executeBulkWrite(MongoCollectionImpl.java:467)        
        at com.mongodb.client.internal.MongoCollectionImpl.bulkWrite(MongoCollectionImpl.java:447)      
... will paste more if necessary                           

在非常小的数据集(10行,但我重新划分为3行)上运行时,此方法有效,但在较大的数据集上运行时,则失败,只有第一批数据被上载到mongoDB,并抛出上述异常。


这个问题最初发布在jira.mongodb.org上,在mongo-spark-connector的作者建议之后,我决定将其移到这里。

0 个答案:

没有答案