在我的scrapy项目的pipelines.py中,我正在尝试将我的已删除项目保存到MongoDB。但是,我不确定我是以正确的方式做到这一点,因为在我刮掉之后,当我进入mongo shell并使用find()方法时,什么都没有回来。在我的scrape期间,scrapy的日志确实告诉我所有这些项目都已被删除,并且使用save to json命令,我的所有项目都被成功删除并保存到json文件中。这是我的pipelines.py看起来像:
import pymongo
from scrapy.conf import settings
from scrapy import log
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.Connection(settings['MONGODB_HOST'], settings['MONGODB_PORT'])
db = connection[settings['MONGODB_DATABASE']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
self.collection.insert(dict(item))
log.msg("Item wrote to MongoDB database {}, collection {}, at host {}, port {}".format(
settings['MONGODB_DATABASE'],
settings['MONGODB_COLLECTION'],
settings['MONGODB_HOST'],
settings['MONGODB_PORT']))
return item
在我的settings.py
中ITEM_PIPELINES = {'sportslab_scrape.pipelines.MongoDBPipeline':300}
MONGODB_HOST = 'localhost' # Change in prod
MONGODB_PORT = 27017 # Change in prod
MONGODB_DATABASE = "training" # Change in prod
MONGODB_COLLECTION = "sportslab"
MONGODB_USERNAME = "" # Change in prod
MONGODB_PASSWORD = "" # Change in prod
在我的scrapy的抓取日志中
2014-11-15 15:28:00-0800 [scrapy] INFO: Item wrote to MongoDB database training, collection sportslab, at host localhost
, port 27017
2014-11-15 15:28:00-0800 [max] DEBUG: Scraped from <200 http://www.maxpreps.com/high-schools/st-john-bosco-braves-(bellf
lower,ca)/football/stats.htm>
{'athlete_name': u'Mike Ray',
'games_played': u'4',
'jersey_number': u'9',
'receiving_long': u'7',
'receiving_num': u'1',
'receiving_tdnum': '',
'receiving_yards': u'7',
'receiving_yards_per_game': u'1.8',
'school': u'St. John Bosco Football',
'yards_per_catch': u'7.0'}
2014-11-15 15:28:00-0800 [max] INFO: Closing spider (finished)
2014-11-15 15:28:00-0800 [max] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 283,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 35344,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 11, 15, 23, 28, 0, 613000),
'item_scraped_count': 28,
'log_count/DEBUG': 31,
'log_count/INFO': 35,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 11, 15, 23, 28, 0, 83000)}
2014-11-15 15:28:00-0800 [max] INFO: Spider closed (finished)
python shell mongo查询。 'test'只是我制作sportslab系列的占位符
>>> from pymongo import Connection
>>> con = Connection()
>>> db = con.training
>>> sportslab = db.sportslab
>>> print sportslab.find()
<pymongo.cursor.Cursor object at 0x0000000002ADB438>
>>> print sportslab.find_one()
{u'test': u'test', u'_id': ObjectId('5466131ca319d723f08d2387')}
>>>