我正在使用 Spark2 应用程序,该应用程序使用com.mongodb.spark.MongoSpark中的以下命令将DataFrame写入三节点MongoDB副本集:
//The real command is similar to this one, depending on options
//set to the DataFrame and the DataFrameWriter object about MongoDB configurations,
//such as the writeConcern
var df: DataFrameWriter[Row] = spark.sql(sql).write
.option("uri", theUri)
.option("database", theDatabase)
.option("collection", theCollection)
.option("replaceDocument", "false")
.mode("append")
[...]
MongoSpark.save(df)
事实是,尽管我确定来自Hive表的源数据具有唯一主键,但是当Spark应用程序运行时,我会得到重复的键错误:
2019-01-14 13:01:08 ERROR: Job aborted due to stage failure: Task 51 in stage 19.0 failed 8 times,
most recent failure: Lost task 51.7 in stage 19.0 (TID 762, mymachine, executor 21):
com.mongodb.MongoBulkWriteException: Bulk write operation error on server myserver.
Write errors: [BulkWriteError{index=0, code=11000,
message='E11000 duplicate key error collection:
ddbb.tmp_TABLE_190114125615 index: idx_unique dup key: { : "00120345678" }', details={ }}].
at com.mongodb.connection.BulkWriteBatchCombiner.getError(BulkWriteBatchCombiner.java:176)
at com.mongodb.connection.BulkWriteBatchCombiner.throwOnError(BulkWriteBatchCombiner.java:205)
[...]
我尝试将写关注点设置为“ 3”甚至“多数”。此外,超时已设置为4/5秒,但有时仍会出现此重复的键错误。
我想知道如何配置负载,以便在写入副本集时不会获得重复的条目。
有什么建议吗?预先感谢!