scrapy始终在爬网后启动新的HTTP连接

时间:2018-01-10 07:21:59

标签: python-3.x scrapy web-crawler scrapy-splash

在我的蜘蛛抓取所有网址之后,scrapy没有停止,如何在抓取完成后停止它?

起始网址为http://http://192.168.139.28/dvwa

在我的蜘蛛完成之后,似乎蜘蛛总是Starting new HTTP connection (1): 192.168.139.28,而且我不知道如何让它自行停止,你能帮助我吗?

以下是输出信息:

 'retry/reason_count/504 Gateway Time-out': 2,
 'scheduler/dequeued': 82,
 'scheduler/dequeued/memory': 82,
 'scheduler/enqueued': 82,
 'scheduler/enqueued/memory': 82,
 'splash/execute/request_count': 40,
 'splash/execute/response_count/200': 38,
 'splash/execute/response_count/400': 1,
 'splash/execute/response_count/504': 3,
 'start_time': datetime.datetime(2018, 1, 10, 6, 36, 4, 298146)}
  2018-01-10 14:37:48 [scrapy.core.engine] INFO: Spider closed (finished)
  2018-01-10 14:38:41 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:38:41 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:39:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:39:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:40:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:40:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:41:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:41:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:42:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:42:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:43:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:43:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:44:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:44:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:45:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:45:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:46:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:46:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:47:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:47:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:48:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:48:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:49:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:49:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:50:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:50:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:51:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:51:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:52:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:52:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  2018-01-10 14:53:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 192.168.139.28
  2018-01-10 14:53:42 [urllib3.connectionpool] DEBUG: http://192.168.139.28:80 "GET / HTTP/1.1" 200 3041
  ...

我正在使用带有scrapy的scrapy_splash,scrapy_splash服务器得到504错误,如here,然后我尝试通过docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 3600启动scrapy_splash服务器,但它没有帮助,scrapy仍然Starting new HTTP connection (1): 192.168.139.28

我的蜘蛛代码是:

from scrapy import cmdline
os.chdir("./crawler")
cmdline.execute('scrapy crawl exp10it'.split())

稍后当我尝试使用命令行:scrapy crawl exploit时,问题不会出现,scrapy在抓取完成后会停止正常,但我不知道为什么{{1}确实没有停止。

0 个答案:

没有答案