Question

我正在使用 Spark2 应用程序，该应用程序使用com.mongodb.spark.MongoSpark中的以下命令将DataFrame写入三节点MongoDB副本集：

    //The real command is similar to this one, depending on options
    //set to the DataFrame and the DataFrameWriter object about MongoDB configurations,
 //such as the writeConcern

    var df: DataFrameWriter[Row] = spark.sql(sql).write
              .option("uri", theUri)
              .option("database", theDatabase)
              .option("collection", theCollection)
              .option("replaceDocument", "false")
              .mode("append")
            [...]    
            MongoSpark.save(df)

事实是，尽管我确定来自Hive表的源数据具有唯一主键，但是当Spark应用程序运行时，我会得到重复的键错误：

2019-01-14 13:01:08 ERROR: Job aborted due to stage failure: Task 51 in stage 19.0 failed 8 times,
     most recent failure: Lost task 51.7 in stage 19.0 (TID 762, mymachine, executor 21): 
com.mongodb.MongoBulkWriteException: Bulk write operation error on server myserver. 
Write errors: [BulkWriteError{index=0, code=11000,
message='E11000 duplicate key error collection: 
    ddbb.tmp_TABLE_190114125615 index: idx_unique dup key: { : "00120345678" }', details={ }}]. 
at com.mongodb.connection.BulkWriteBatchCombiner.getError(BulkWriteBatchCombiner.java:176)
at com.mongodb.connection.BulkWriteBatchCombiner.throwOnError(BulkWriteBatchCombiner.java:205)
[...]

我尝试将写关注点设置为“ 3”甚至“多数”。此外，超时已设置为4/5秒，但有时仍会出现此重复的键错误。

我想知道如何配置负载，以便在写入副本集时不会获得重复的条目。

有什么建议吗？预先感谢！

写入MongoDB副本集

0 个答案: