我做了一个(scrapy + frontera)抓取单个网页的示例。在我输入命令 scrapy crawl myProject 后,我将其作为输出 -
E:\scrapyProject\mirchi>scrapy crawl dmoz
2015-08-17 22:12:54 [scrapy] INFO: Scrapy 1.0.3 started (bot: mirchi)
2015-08-17 22:12:54 [scrapy] INFO: Optional features available: ssl, http11
2015-08-17 22:12:54 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mirchi.spiders', 'SPIDER_MODULES': ['mirchi.spiders'], 'SCHEDULER': 'frontera.contrib.scrapy.scheduler
s.frontier.FronteraScheduler', 'BOT_NAME': 'mirchi'}
2015-08-17 22:12:58 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-17 22:13:04 [py.warnings] WARNING: C:\Python27\lib\site-packages\frontera\contrib\scrapy\schedulers\frontier.py:3: ScrapyDeprecationWarning: Module `scrapy.log` has been de
precated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
from scrapy import log
2015-08-17 22:13:06 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, Me
taRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats, SchedulerDownloaderMiddleware
2015-08-17 22:13:06 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware, SchedulerSpiderMiddl
eware
2015-08-17 22:13:06 [scrapy] INFO: Enabled item pipelines:
2015-08-17 22:13:06 [scrapy] INFO: Spider opened
2015-08-17 22:13:06 [py.warnings] WARNING: C:\Python27\lib\site-packages\frontera\contrib\scrapy\schedulers\frontier.py:123: ScrapyDeprecationWarning: log.msg has been deprecated,
create a python logger and log through it instead
log.msg('Starting frontier', log.INFO)
2015-08-17 22:13:06 [scrapy] INFO: Starting frontier
2015-08-17 22:13:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-17 22:13:06 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-17 22:13:06 [scrapy] INFO: Closing spider (finished)
2015-08-17 22:13:06 [py.warnings] WARNING: C:\Python27\lib\site-packages\frontera\contrib\scrapy\schedulers\frontier.py:128: ScrapyDeprecationWarning: log.msg has been deprecated,
create a python logger and log through it instead
log.msg('Finishing frontier (%s)' % reason, log.INFO)
2015-08-17 22:13:06 [scrapy] INFO: Finishing frontier (finished)
2015-08-17 22:13:06 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 8, 17, 16, 43, 6, 848000),
'frontera/iterations': 0,
'frontera/pending_requests_count': 0,
'frontera/seeds_count': 1,
'log_count/DEBUG': 1,
'log_count/INFO': 9,
'log_count/WARNING': 3,
'start_time': datetime.datetime(2015, 8, 17, 16, 43, 6, 681000)}
2015-08-17 22:13:06 [scrapy] INFO: Spider closed (finished)
如果我只在scrapy上运行上面的代码,它的工作正常。但结合后两个输出说 - 0页潦草。我做错了,请在评论中提及是否需要更多信息。
修改 - 代码 mirchi \ mirchi \ settings.py - >
BOT_NAME = 'mirchi'
SPIDER_MODULES = ['mirchi.spiders']
NEWSPIDER_MODULE = 'mirchi.spiders'
SPIDER_MIDDLEWARES = {
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
}
DOWNLOADER_MIDDLEWARES = {
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
}
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
FRONTERA_SETTINGS = 'frontera.settings'
代码 mirchi \ mirchi \ frontera \ settings.py - >
BACKEND = 'frontera.contrib.backends.memory.heapq.BFS'