Question

我在mongodb中有大约170万份文件（未来10m +）。其中一些代表我不想要的重复条目。文档的结构是这样的：

{
    _id: 14124412,
    nodes: [
        12345,
        54321
        ],
    name: "Some beauty"
}

如果文档与同名的至少一个节点相同，则该文档是重复的。删除重复项的最快方法是什么？

Answer 1

dropDups: true选项在3.0中不可用。

我有聚合框架的解决方案，用于收集重复项，然后一次性删除。

它可能比系统级“索引”更改慢一些。但考虑到你想删除重复文件的方式很好。

一个。一次删除所有文件

var duplicates = [];

db.collectionName.aggregate([
  { $match: { 
    name: { "$ne": '' }  // discard selection criteria
  }},
  { $group: { 
    _id: { name: "$name"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )    
})

// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);     

// Remove all duplicates in one go    
db.collectionName.remove({_id:{$in:duplicates}})

湾您可以逐个删除文档。

db.collectionName.aggregate([
  // discard selection criteria, You can remove "$match" section if you want
  { $match: { 
    source_references.key: { "$ne": '' }  
  }},
  { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates
})

Answer 2

假设您要永久删除集合中包含重复name + nodes条目的文档，您可以使用dropDups: true选项添加unique索引：< / p>

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})

正如文档所说，请谨慎使用，因为它会删除数据库中的数据。首先备份数据库，以防它没有按照您的预期完成。

<强>更新

此解决方案仅在MongoDB 2.x中有效，因为dropDups选项在3.0（docs）中不再可用。

Answer 3

使用mongodump

创建集合转储

清除收藏

添加唯一索引

使用mongorestore恢复集合

Answer 4

我发现这个解决方案适用于MongoDB 3.4：我假设带有重复项的字段称为fieldX

db.collection.aggregate([
{
    // only match documents that have this field
    // you can omit this stage if you don't have missing fieldX
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
    $replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})

作为mongoDB的新手，我花了很多时间并使用其他冗长的解决方案来查找和删除重复项。但是，我认为这个解决方案很简单易懂。

它的工作方式是首先匹配包含fieldX的文档（我有一些没有这个字段的文档，我得到了一个额外的空结果）。

下一阶段按字段X对文档进行分组，并仅使用$first在每个组中插入$$ROOT文档。最后，它使用$ first和$$ ROOT找到的文档替换整个聚合组。

我必须添加allowDiskUse，因为我的集合很大。

您可以在任意数量的管道之后添加此内容，虽然$ first的文档在使用 $ first 之前提到了排序阶段，但如果没有它，它对我有用。 “无法在这里发布链接，我的名声不到10 :(”

您可以通过添加$ out stage ...

将结果保存到新的集合中

另外，如果只对一些字段感兴趣，例如field1，field2，而不是整个文档，在没有replaceRoot的组阶段中：

db.collection.aggregate([
{
    // only match documents that have this field
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})

Answer 5

我的数据库有数百万条重复的记录。 @somnath的答案没有奏效，因此编写了对希望删除数百万条重复记录的人有用的解决方案。

/** Create a array to store all duplicate records ids*/
var duplicates = [];

/** Start Aggregation pipeline*/
db.collection.aggregate([
  {
    $match: { /** Add any filter here. Add index for filter keys*/
      filterKey: {
        $exists: false
      }
    }
  },
  {
    $sort: { /** Sort it in such a way that you want to retain first element*/
      createdAt: -1
    }
  },
  {
    $group: {
      _id: {
        key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
      },
      dups: {
        $push: {
          _id: "$_id"
        }
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        "$gt": 1
      }
    }
  }
],
{
  allowDiskUse: true
}).forEach(function(doc){
  doc.dups.shift();
  doc.dups.forEach(function(dupId){
    duplicates.push(dupId._id);
  })
})

/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
    temparray = duplicates.slice(i,i+chunk);
    db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}

Answer 6

我不知道这是否会回答主要问题，但对其他人来说将很有用。 1.使用findOne（）方法查询重复的行并将其存储为对象。

const User = db.User.findOne({_id:"duplicateid"});

2。执行deleteMany（）方法删除所有ID为“ duplicateid”的行

db.User.deleteMany({_id:"duplicateid"});

3。插入存储在User对象中的值。

db.User.insertOne(User);

轻松快捷！!!!

Answer 7

一般的想法是使用findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/ 从集合中的重复记录中检索一个随机ID。
删除集合中的所有记录，而不是我们从findOne选项中检索到的随机ID。

如果你想在pymongo中做这件事，你可以这样做。

def _run_query():

        try:

            for record in (aggregate_based_on_field(collection)):
                if not record:
                    continue
                _logger.info("Working on Record %s", record)

                try:
                    retain = db.collection.find_one(find_one({'fie1d1': 'x',  'field2':'y'}, {'_id': 1}))
                    _logger.info("_id to retain from duplicates %s", retain['_id'])

                    db.collection.remove({'fie1d1': 'x',  'field2':'y', '_id': {'$ne': retain['_id']}})

                except Exception as ex:
                    _logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))

        except Exception as e:
            _logger.error("Mongo error when deleting duplicates %s", str(e))


def aggregate_based_on_field(collection):
    return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])

来自shell：

将find_one替换为findOne
相同的删除命令应该有效。

Answer 8

以下方法合并具有相同名称的文档，同时仅保留唯一节点而不复制它们。

我发现使用<Expander Grid.Row="0" Grid.RowSpan="2" Grid.Column="1" Width="25" ExpandDirection="Left"> <Expander.Header> <TextBlock TextAlignment="Center"> <Run Text="S"/> <LineBreak/> <Run Text="E"/> <LineBreak/> <Run Text="T"/> <LineBreak/> <Run Text="T"/> <LineBreak/> <Run Text="I"/> <LineBreak/> <Run Text="N"/> <LineBreak/> <Run Text="G"/> <LineBreak/> <Run Text="S"/> </TextBlock> </Expander.Header> </Expander>运算符是一种简单的方法。我展开数组，然后通过添加到集合将其分组。 $out运算符允许聚合结果保留[docs]。如果您输入集合名称本身，它将用新数据替换集合。如果名称不存在，它将创建一个新集合。

希望这会有所帮助。

$out可能必须添加到管道中。

allowDiskUse

Answer 9

使用 pymongo 应该可以。

在unique_field中为集合添加需要唯一的字段

unique_field = {"field1":"$field1","field2":"$field2"}

cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)

根据重复次数对dups数组进行切片（这里我只有一个额外的重复项）

items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})

Answer 10

首先，您可以找到所有重复项并在数据库中删除这些重复项。这里我们以 id 列来检查和删除重复项。

db.collection.aggregate([
    { "$group": { "_id": "$id", "count": { "$sum": 1 } } },
    { "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } },
    { "$sort": { "count": -1 } },
    { "$project": { "name": "$_id", "_id": 0 } }
]).then(data => {
    var dr = data.map(d => d.name);
    console.log("duplicate Recods:: ", dr);
    db.collection.remove({ id: { $in: dr } }).then(removedD => {
        console.log("Removed duplicate Data:: ", removedD);
    })
})

Answer 11

当只有一小部分文档被复制时加快速度的提示：

您需要在字段上建立索引以检测重复项。
$group 不使用索引，但它可以利用 $sort 和 $sort 使用索引。所以你应该在开头放一个 $sort 步骤
对新集合执行 inplace delete_many() 而不是 $out，这将节省大量 IO 时间和磁盘空间。

如果您使用 pymongo，您可以：

index_uuid = IndexModel(
    [
        ('uuid', pymongo.ASCENDING)
    ],
)
col.create_indexes([index_uuid])
pipeline = [
    {"$sort": {"uuid":1}},
    {
        "$group": {
            "_id": "$uuid",
            "dups": {"$addToSet": "$_id"},
            "count": {"$sum": 1}
        }
    },
    {
        "$match": {"count": {"$gt": 1}}
    },
]
it_cursor = col.aggregate(
    pipeline, allowDiskUse=True
)
# skip 1st dup of each dups group
dups = list(itertools.chain.from_iterable(map(lambda x: x["dups"][1:], it_cursor)))
col.delete_many({"_id":{"$in": dups}})

性能

我在一个包含 30M 文档和 1TB 大的数据库上对其进行了测试。

如果没有索引/排序，获取游标需要一个多小时（我什至没有耐心等待）。
使用索引/排序但使用 $out 输出到新集合。如果您的文件系统不支持快照，这会更安全。但是，尽管我们使用的是 SSD，但它需要大量磁盘空间并且需要 40 多分钟才能完成。如果您使用 HDD RAID，速度会慢得多。
使用索引/排序和就地 delete_many，总共需要大约 5 分钟。

Answer 12

这是一个稍微多一点的手册＆＃39;这样做的方式：

基本上，首先，获取您感兴趣的所有唯一键的列表。

然后使用其中每个键执行搜索，如果搜索返回大于1，则删除。

  db.collection.distinct("key").forEach((num)=>{
    var i = 0;
    db.collection.find({key: num}).forEach((doc)=>{
      if (i)   db.collection.remove({key: num}, { justOne: true })
      i++
    })
  });

在mongodb中删除重复文档的最快方法

12 个答案:

性能