我正在运行Scrapyd并在同时启动4个蜘蛛时遇到一个奇怪的问题。
2012-02-06 15:27:17+0100 [HTTPChannel,0,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:17+0100 [HTTPChannel,1,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:17+0100 [HTTPChannel,2,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:17+0100 [HTTPChannel,3,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:18+0100 [Launcher] Process started: project='thz' spider='spider_1' job='abb6b62650ce11e19123c8bcc8cc6233' pid=2545
2012-02-06 15:27:19+0100 [Launcher] Process finished: project='thz' spider='spider_1' job='abb6b62650ce11e19123c8bcc8cc6233' pid=2545
2012-02-06 15:27:23+0100 [Launcher] Process started: project='thz' spider='spider_2' job='abb72f8e50ce11e19123c8bcc8cc6233' pid=2546
2012-02-06 15:27:24+0100 [Launcher] Process finished: project='thz' spider='spider_2' job='abb72f8e50ce11e19123c8bcc8cc6233' pid=2546
2012-02-06 15:27:28+0100 [Launcher] Process started: project='thz' spider='spider_3' job='abb76f6250ce11e19123c8bcc8cc6233' pid=2547
2012-02-06 15:27:29+0100 [Launcher] Process finished: project='thz' spider='spider_3' job='abb76f6250ce11e19123c8bcc8cc6233' pid=2547
2012-02-06 15:27:33+0100 [Launcher] Process started: project='thz' spider='spider_4' job='abb7bb8e50ce11e19123c8bcc8cc6233' pid=2549
2012-02-06 15:27:35+0100 [Launcher] Process finished: project='thz' spider='spider_4' job='abb7bb8e50ce11e19123c8bcc8cc6233' pid=2549
我已经为Scrapyd设置了这些设置:
[scrapyd]
max_proc = 10
为什么Scrapyd不会同时运行蜘蛛,就像它们的安排一样快?
答案 0 :(得分:7)
我已经通过编辑第30行的scrapyd / app.py解决了这个问题。
将timer = TimerService(5, poller.poll)
更改为timer = TimerService(0.1, poller.poll)
编辑: AliBZ下面关于配置设置的评论是更改轮询频率的更好方法。
答案 1 :(得分:4)
根据我对scrapyd的经验,当你安排蜘蛛时,它不会立即运行蜘蛛。它通常会等待一段时间,直到当前的蜘蛛启动并运行,然后它开始下一个蜘蛛进程(scrapy crawl
)。
因此,scrapyd逐个启动流程,直到达到max_proc
计数。
从您的日志中我看到每只蜘蛛的运行时间约为1秒。我想,如果你的蜘蛛至少运行30秒,你会看到所有的蜘蛛都在跑。