Question

我正在构建一个MongoDB数据库，问题是我想避免重复的条目。目前我正在这样做（仅在检查条目是否存在后插入文档）：

from pymongo import Connection 
import pandas as pd
from time import strftime
from collections import OrderedDict

connection = Connection()
db = connection.mydb 
collection = db.mycollection

data = pd.read_csv("data/myfile.csv", parse_dates=[2,5])

for i in range(len(data)):
    if(collection.find({ "id":     data.ix[0],                         \
                         "date1":  data.ix[i, 2].strftime("%Y-%m-%d"), \
                         "date2":  data.ix[i, 5].strftime("%Y-%m-%d"), \
                         "number": int(data.ix[i, 6]),                 \
                         "type":   data.ix[i, 7]}).count() == 0):
        collection.insert(here goes what I'd like to insert)

哪种方法运行正常，但这已经存在严重的性能问题（只有大约100Mb的数据），因为每次执行find()似乎都会显着降低速度。

有没有办法加快速度？也许我从根本上做错了？我需要避免仅在某些字段集上重复，而不是所有字段（即，也有“number2”，这可能是不同的，但如果所有其他字段都匹配，我仍然希望将它作为重复）。

Answer 1

您可以在要搜索的字段上构建unique index（mongo shell语法）：

db.mycollection.ensureIndex({_id:1, date1:1, date2:1, number:1, type:1}, {unique: true});

当插入重复项时，捕获约束违规异常（如果合适，则忽略它）。

通常这会提高性能，因为重复检查是通过索引查找完成的。

Answer 2

插入前检查不是防止它的好方法。要防止重复键，请使用主键。见how to set a primary key in mongodb

如果它对你不利，至少要添加mongo index

解决这个问题的最佳方法（我认为）将是从相关的所有领域生成密钥，然后执行2中的一个：

检查该密钥，如果将是索引，将会更快
将此密钥设为主键，插入将失败

Answer 3

您可以使用Upsert标志执行update（）操作，请参阅Update Operations with the Upsert Flag。

此外，MongoDB中已经有一个名为“_id”的内置ID，因此您可以根据需要使用它。下面是它的样子：

collection.update(
    { "_id": ObjectID(data.ix[0]),
      "date1": data.ix[i, 2].strftime("%Y-%m-%d")
    }, 
    { "_id": ObjectID(data.ix[0]),
      "date1": data.ix[i, 2].strftime("%Y-%m-%d")
    },
    True
    )

检查MongoDB中是否存在记录

3 个答案: