Scrapy-Splash没有呈现此页面

时间:2018-07-20 11:14:56

标签: python html web-scraping scrapy scrapy-splash

任何人都可以帮助您理解Splash为什么不呈现此页面的原因,以便我进行抓取。

url:https://www6.hertsmere.gov.uk/online-applications/weeklyListResults.do?action=firstPage

这是我写的蜘蛛:

class planningApplications(scrapy.Spider):
  name = 'planning-application'

def start_requests(self):
    yield SplashRequest(
        url='https://www6.hertsmere.gov.uk/online- 
applications/weeklyListResults.do?action=firstPage',
        callback=self.parse
    )

def parse(self, response):
    self.log('I just visited: ' + response.url)
    self.log(response.body_as_unicode())
    item = {
        'test': response.xpath('//*[@id="searchresults"]/li[1]/a').extract_first()
    }
    yield item

这是我在settings.py中具有的与Splash相关的设置:

SPLASH_URL = 'http://localhost:8050/'

 DOWNLOADER_MIDDLEWARES = {
   'scrapy_splash.SplashCookiesMiddleware': 723,
   'scrapy_splash.SplashMiddleware': 725,
   'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 
810, 
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

我曾尝试使用下面的代码行使用scrapy shell,并打印出响应unicode,但其中没有用于保存计划应用程序的html。

 scrapy shell 'http://localhost:8050/render.html?url=https://www6.hertsmere.gov.uk/online-applications/pagedSearchResults.do?action=page&searchCriteria.page=2'

如果在网站上使用scrapy-splash的方法不起作用,您会建议使用scrapy-spele硒吗?

任何帮助将不胜感激:)

1 个答案:

答案 0 :(得分:0)

我用您的配置制作了新蜘蛛,问题出在robots.txt

  

调试:robots.txt禁止:https://www6.hertsmere.gov.uk/online-applications/weeklyListResults.do?action = firstPage>

抓取首先需要下载robots.txt,然后再抓取。要更改它,您需要将False的价值分配给ROBOTSTXT_OBEY

转到settings.py并进行更改。

ROBOTSTXT_OBEY = False

我收到的一些输出

http://www.w3.org/1999/xhtml" xml:lang="en" cla$s="js"><head>

<!-- #BeginEditable "doctitle" -->
<title>
    Error
</title>