使用pymongo和scrapy检查MongoDB中是否存在id

时间:2015-11-11 17:44:02

标签: mongodb python-2.7 scrapy scrapy-pipeline

我已经设置了一个带有scrapy的蜘蛛,它将数据发送到MongoDB数据库。我想检查id是否存在,如果它存在,我可以在特定键上添加$ addToSet(否则Mongo将拒绝插入,因为_id已经存在)。

这是我的pipelines.py:

import pymongo

class MongoDBPipeline(object):

    collection_name = 'logfile'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def idExists(self, item, spider):
        #this next line is where I'm running into trouble
        if self.db[self.collection_name].find({'_id': dict(item['_id'])}).limit(1).size() > 0
            return True
        else:
            return False

    def process_item(self, item, spider):
        if idExists == False:
            self.db[self.collection_name].insert(dict(item))
            return item
        else:
            pass #write the line to add only to the array with $addtoset

我的items.py看起来像:

import scrapy

class CallLog(scrapy.Item):
    _id = scrapy.Field()
    placed = scrapy.Field()
    answered = scrapy.Field()

我的蜘蛛看起来像:

import scrapy
import time

from callStats.items import CallLog
from scrapy.selector import Selector
from selenium import webdriver


class LogSpider(scrapy.Spider):
    name = "logspider"
    start_urls = [
        "http://www.domain.com/log1.htm",
        "http://www.domain.com/log2.htm",
        "http://www.domain.com/log3.htm"
    ]

    def __init__(self):
        scrapy.Spider.__init__(self)
        self.browser = webdriver.PhantomJS()

    def __del__(self):
        self.browser.exit()

    def parse(self, response):
        item = CallLog()
        self.browser.get(response.url)
        time.sleep(3) #Wait for javscript to load in Selenium

        if response.request.url == "http://www.domain.com/log1.htm":
            idname = "Kirk"
        elif response.request.url == "http://www.domain.com/log2.htm":
            idname = "Jim"
        elif response.request.url == "http://www.domain.com/log3.htm":
            idname = "Spock"

        # scrape dynamically generated HTML
        hxs = Selector(text=self.browser.page_source)
        item['_id'] = idname
        item['placed'] = hxs.xpath('myxpath1').extract()
        item['answered'] = hxs.xpath('myxpath2').extract()
        return item

我在此行的pipelines.py中遇到语法错误:

if self.db[self.collection_name].find({'_id':dict(item['_id'])}).limit(1).size() > 0 

这是追溯:

Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "C:\Python27\lib\site-packages\scrapy\commands\crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 153, in crawl
    d = crawler.crawl(*args, **kwargs)
  File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 71, in crawl
    self.engine = self._create_engine()
  File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 83, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "C:\Python27\lib\site-packages\scrapy\core\engine.py", line 67, in __init__
    self.scraper = Scraper(crawler)
  File "C:\Python27\lib\site-packages\scrapy\core\scraper.py", line 70, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "C:\Python27\lib\site-packages\scrapy\middleware.py", line 56, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "C:\Python27\lib\site-packages\scrapy\middleware.py", line 32, in from_settings
    mwcls = load_object(clspath)
  File "C:\Python27\lib\site-packages\scrapy\utils\misc.py", line 44, in load_object
    mod = import_module(module)
  File "C:\Python27\lib\importlib\__init__.py", line 37, in import_module
    __import__(name)
exceptions.SyntaxError: invalid syntax (pipelines.py, line 36)
2015-11-11 11:12:27 [twisted] CRITICAL:

我正在拔头发,因为我觉得我真的很接近这个工作。在robomongo中,当我运行此查询时:

db.getCollection('logfile').find({'_id': 'Jim'})

它向我展示了吉姆的文件。我只是不能为我的生活弄清楚要放在find()中的内容,以便它抓取我正在爬行的当前页面的_id。

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:0)

我没有尝试在管道中创建新方法,而是检查了process_item方法中是否存在'_id'键,如下所示:

    def process_item(self, item, spider):
        if self.db[self.collection_name].find({'_id': dict(item)['_id']}).limit(1).count() > 0:
            pass
        else:
            self.db[self.collection_name].insert(dict(item))

提示user3100115获取语法帮助。