Question

任何人都可以帮助您理解Splash为什么不呈现此页面的原因，以便我进行抓取。

url：https://www6.hertsmere.gov.uk/online-applications/weeklyListResults.do?action=firstPage

这是我写的蜘蛛：

class planningApplications(scrapy.Spider):
  name = 'planning-application'

def start_requests(self):
    yield SplashRequest(
        url='https://www6.hertsmere.gov.uk/online- 
applications/weeklyListResults.do?action=firstPage',
        callback=self.parse
    )

def parse(self, response):
    self.log('I just visited: ' + response.url)
    self.log(response.body_as_unicode())
    item = {
        'test': response.xpath('//*[@id="searchresults"]/li[1]/a').extract_first()
    }
    yield item

这是我在settings.py中具有的与Splash相关的设置：

SPLASH_URL = 'http://localhost:8050/'

 DOWNLOADER_MIDDLEWARES = {
   'scrapy_splash.SplashCookiesMiddleware': 723,
   'scrapy_splash.SplashMiddleware': 725,
   'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 
810, 
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

我曾尝试使用下面的代码行使用scrapy shell，并打印出响应unicode，但其中没有用于保存计划应用程序的html。

 scrapy shell 'http://localhost:8050/render.html?url=https://www6.hertsmere.gov.uk/online-applications/pagedSearchResults.do?action=page&searchCriteria.page=2'

如果在网站上使用scrapy-splash的方法不起作用，您会建议使用scrapy-spele硒吗？

任何帮助将不胜感激：）

Answer 1

我用您的配置制作了新蜘蛛，问题出在robots.txt。

调试：robots.txt禁止：https：//www6.hertsmere.gov.uk/online-applications/weeklyListResults.do？action = firstPage>

抓取首先需要下载robots.txt，然后再抓取。要更改它，您需要将False的价值分配给ROBOTSTXT_OBEY。

转到settings.py并进行更改。

ROBOTSTXT_OBEY = False

我收到的一些输出。

http://www.w3.org/1999/xhtml" xml:lang="en" cla$s="js"><head>

<!-- #BeginEditable "doctitle" -->
<title>
    Error
</title>

Scrapy-Splash没有呈现此页面

1 个答案: