使用scrapy检查pymongo数据库中是否存在item字段

时间:2018-03-30 13:59:22

标签: python scrapy pymongo

我正在尝试检查数据库中是否已存在item ['email'],如果它不存在则插入pymongo数据库。

我不想在pymongo数据库中重复发送电子邮件。

然而我得到了这个

  

ValueError:字典更新序列元素#0的长度为17; 2是   需要

这是我到目前为止所拥有的

Pipelines.py

class myExporter(object):

    def __init__(self):
        i = 0
        while os.path.exists(SRCFILE % i):
            i += 1
        self.filename = SRCFILE % i
        with open(self.filename, 'w') as output:
            output = csv.writer(output)
            output.writerow(['Email', 'Website', 'Phone Number', 'Location'])
        connection = pymongo.MongoClient(settings['MONGODB_HOST'], settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DATABASE']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        self.email = self.collection.find(dict(item['email']))
        for x in self.email:
            if x not in self.email:               
                self.collection.insert(dict(item))
                log.msg("Item wrote to MongoDB database {}, collection {}, at host {}, port {}".format(
                    settings['MONGODB_DATABASE'],
                    settings['MONGODB_COLLECTION'],
                    settings['MONGODB_HOST'],
                    settings['MONGODB_PORT']))
                with open(self.filename, 'a') as output:
                    output = csv.writer(output)
                    output.writerow([item['email'],
                                     item['website'],
                                     item['phonenumber'],
                                     item['location']])
                folder = os.path.join(DESTINATION_FOLDER, os.path.basename(self.filename))
                shutil.copy(self.filename, folder)
                return item

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

2 个答案:

答案 0 :(得分:0)

您的错误是在调用self.collection.find(dict(item['email']))时。
item已经是一个包含密钥email的字典,所以有 无需用dict()包装它。使用item['email']可以很好地访问电子邮件的价值。

然后,您应该检查self.email是否包含所需的结果,然后继续执行其余的功能逻辑。

编辑

评论中的新错误表明查询查询中的过滤器必须是字典。如果您在Mongo中寻找{'email': item['email']}字段,请使用email

答案 1 :(得分:0)

$userchildren = UserChildren::find()->where(['user_id' => $model->id])->all(); 不会保存已添加到数据库的电子邮件

if ($model->load(Yii::$app->request->post()) && $profile->load(Yii::$app->request->post()) && $billinginfo->load(Yii::$app->request->post()) ) {

完成代码

dropDups = True