将Scrapy项目保存到MongoDB

时间:2014-11-15 23:39:45

标签: python mongodb scrapy

在我的scrapy项目的pipelines.py中,我正在尝试将我的已删除项目保存到MongoDB。但是,我不确定我是以正确的方式做到这一点,因为在我刮掉之后,当我进入mongo shell并使用find()方法时,什么都没有回来。在我的scrape期间,scrapy的日志确实告诉我所有这些项目都已被删除,并且使用save to json命令,我的所有项目都被成功删除并保存到json文件中。这是我的pipelines.py看起来像:

import pymongo
from scrapy.conf import settings
from scrapy import log


class MongoDBPipeline(object):
    def __init__(self):
        connection = pymongo.Connection(settings['MONGODB_HOST'], settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DATABASE']]
        self.collection = db[settings['MONGODB_COLLECTION']]

def process_item(self, item, spider):
    self.collection.insert(dict(item))
    log.msg("Item wrote to MongoDB database {}, collection {}, at host {}, port {}".format(
        settings['MONGODB_DATABASE'],
        settings['MONGODB_COLLECTION'],
        settings['MONGODB_HOST'],
        settings['MONGODB_PORT']))
    return item

在我的settings.py

ITEM_PIPELINES = {'sportslab_scrape.pipelines.MongoDBPipeline':300}

MONGODB_HOST = 'localhost' # Change in prod
MONGODB_PORT = 27017 # Change in prod
MONGODB_DATABASE = "training" # Change in prod
MONGODB_COLLECTION = "sportslab"
MONGODB_USERNAME = "" # Change in prod
MONGODB_PASSWORD = "" # Change in prod

在我的scrapy的抓取日志中

2014-11-15 15:28:00-0800 [scrapy] INFO: Item wrote to MongoDB database training, collection sportslab, at host localhost
, port 27017
2014-11-15 15:28:00-0800 [max] DEBUG: Scraped from <200 http://www.maxpreps.com/high-schools/st-john-bosco-braves-(bellf
lower,ca)/football/stats.htm>
        {'athlete_name': u'Mike Ray',
         'games_played': u'4',
         'jersey_number': u'9',
         'receiving_long': u'7',
         'receiving_num': u'1',
         'receiving_tdnum': '',
         'receiving_yards': u'7',
         'receiving_yards_per_game': u'1.8',
         'school': u'St. John Bosco Football',
         'yards_per_catch': u'7.0'}
2014-11-15 15:28:00-0800 [max] INFO: Closing spider (finished)
2014-11-15 15:28:00-0800 [max] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 283,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 35344,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 11, 15, 23, 28, 0, 613000),
         'item_scraped_count': 28,
         'log_count/DEBUG': 31,
         'log_count/INFO': 35,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2014, 11, 15, 23, 28, 0, 83000)}
2014-11-15 15:28:00-0800 [max] INFO: Spider closed (finished)

python shell mongo查询。 'test'只是我制作sportslab系列的占位符

>>> from pymongo import Connection
>>> con = Connection()
>>> db = con.training
>>> sportslab = db.sportslab
>>> print sportslab.find()
<pymongo.cursor.Cursor object at 0x0000000002ADB438>
>>> print sportslab.find_one()
{u'test': u'test', u'_id': ObjectId('5466131ca319d723f08d2387')}
>>>

0 个答案:

没有答案