我需要爬几个蜘蛛。我正在使用scrapyd和默认设置。我设法用scrapyd界面安排我的工作。这一切都很好,除了工作没有结束。每次我检查我发现16个(4个工作/ 4个cpus)工作正在运行而所有其他工作都在等待,除非我关闭scrapy。
我还检查了日志,并说:
2013-09-22 12:20:55+0000 [spider1] INFO: Dumping Scrapy stats:
{
'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 244,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 7886,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 9, 22, 12, 20, 55, 635611),
'log_count/DEBUG': 7,
'log_count/INFO': 3,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2013, 9, 22, 12, 20, 55, 270275)}
2013-09-22 12:20:55+0000 [spider1] INFO: Spider closed (finished)
你怎么用scrapyd刮掉数百只蜘蛛?
编辑:
scrapy.cfg:
[settings]
default = myproject.scrapers.settings
[deploy]
url = http://localhost:6800/
project = myproject
version = GIT
[scrapyd]
eggs_dir = scrapy_dir/eggs
logs_dir = scrapy_dir/logs
items_dir = scrapy_dir/items
dbs_dir = scrapy_dir/dbs
scrapy settings.py
import os
from django.conf import settings
PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__))
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "myproject.settings")
BOT_NAME = 'scrapers'
SPIDER_MODULES = ['myproject.scrapers.spiders']
DOWNLOADER_MIDDLEWARES = {
'myproject.scrapers.middlewares.IgnoreDownloaderMiddleware': 50,
}
ITEM_PIPELINES = [
'myproject.scrapers.pipelines.CheckPipeline',
'myproject.scrapers.pipelines.CleanPipeline',
'myproject.contrib.pipeline.images.ImagesPipeline',
'myproject.scrapers.pipelines.SerializePipeline',
'myproject.scrapers.pipelines.StatsCollectionPipeline',
]
DOWNLOAD_DELAY = 0.25
path_to_phatomjs = '/home/user/workspace/phantomjs-1.9.1-linux-x86_64/bin/phantomjs'
IMAGES_STORE = settings.MEDIA_ROOT + '/' + settings.IMAGES_STORE
IMAGES_THUMBS = {
'small': (70, 70),
'big': (270, 270),
}
答案 0 :(得分:0)
我一发现问题的根源就试图在昨天发布这个答案,但我的帐户出了问题。
问题来自PhantomJs驱动程序,它阻止了scrapyd完成工作。
起初我用删除功能退出驱动程序:
def __del__(self):
self.driver.quit()
...
现在我创建了一个函数quit_driver,并将其挂钩到spider_closed信号。
@classmethod
def from_crawler(cls, crawler):
temp = cls(crawler.stats)
crawler.signals.connect(temp.quit_driver, signal=signals.spider_closed)
return temp