Question

在我的Scrapy项目中，我将使用PyMongo将抓取的数据存储在MongoDB中。在逐页爬网网页时有重复记录，我只想在将它们插入数据库时删除具有相同名称的重复记录。请给我建议最好的解决方案。这是我在"pipelines.py"中的代码。请指导我如何在方法"process_item"中删除重复项。我发现很少有查询可以从Internet上的数据库中删除重复项，但需要Python解决方案。

from pymongo import MongoClient
from scrapy.conf import settings
class MongoDBPipeline(object):

    def __init__(self):
        connection = MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        self.collection.insert(dict(item))
        return item

Answer 1

这在某种程度上取决于item中的内容，但是我会使用带有upsert的update

def process_item(self, item, spider):
    # pseudo example
    _filter = item.get('website')
    update = item.get('some_params')
    if _filter:
        # example
        # self.collection.update_one(
        #     {"website": "abc"}, 
        #     {"div foo": "sometext"}, 
        #     upsert=True
        #     )

        self.collection.update_one(_filter, update, upsert=True)
    return item

您也可以使用过滤器。基本上，您甚至不必删除重复项。如果正确应用，它的工作原理类似于if-else条件。如果对象不存在，请创建一个。否则，使用给定键上的给定属性进行更新。就像在字典中一样。最坏的情况下，它将使用相同的值进行更新。因此，它比插入，查询和删除找到的重复项更快。

docs

MongoDB中没有文字if-else，带有automatically dropping dupes的@tanaydin建议也可以在Python中使用。根据您的实际需求，它可能比我的建议更好。

如果您确实要根据特定条件删除文档，则pymongo中有delete_one和delete_many。

docs

如何在Scrapy项目中使用PyMongo在插入新记录MongoDB时删除重复项

1 个答案: