尝试使用Scrapy和Splash擦除JS页面时出错

时间:2018-09-13 09:36:27

标签: python lua scrapy scrapy-splash scrapy-shell

但是我一直在外壳中遇到这个问题。

 2018-09-13 14:50:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
 2018-09-13 14:50:36 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6028
 2018-09-13 14:50:37 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
 2018-09-13 14:50:38 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://localhost:8050/robots.txt> (referer: None)
 2018-09-13 14:51:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
 2018-09-13 14:51:36 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
 2018-09-13 14:51:40 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (failed 2 times): 504 Gateway Time-out
 2018-09-13 14:52:00 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (failed 3 times): 502 Bad Gateway
 2018-09-13 14:52:00 [scrapy.core.engine] DEBUG: Crawled (502) <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (referer: None)
 2018-09-13 14:52:00 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <502 http://quotes.toscrape.com/js/>: HTTP status code is not handled or not allowed

这是我的代码:

import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["http://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')

    def parse(self, response):
        for quote in response.css("div.quote"):
        scraped_info={
         'authorname':quote.css('small.author::text').extract_first(), 
         'quote':quote.css('span.text::text').extract_first(),}
          yield scraped_info

我已经安装了scrapy-splash,并且还放置了这些命令 在settings.py中。另外我的启动服务器正在运行 http://localhost:8050/

另外,当我尝试在启动服务器上呈现任何url时,又出现了另一个错误:

  

HTTP错误400(错误请求)类型:ScriptError-> LUA_ERROR错误   在执行Lua脚本时发生了

     

Lua错误:[字符串“ function main(splash,args)...”]:2:network3

我正在使用:

  • 初始版本:3.2

  • 卢阿5.2

0 个答案:

没有答案