出于某种原因,我在使用Splash时有任何请求403。我做错了什么?
关注https://github.com/scrapy-plugins/scrapy-splash我设置了所有设置:
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
使用泊坞窗启动启动
sudo docker run -p 8050:8050 scrapinghub / splash
蜘蛛代码:
import scrapy
from scrapy import Selector
from scrapy_splash import SplashRequest
class VestiaireSpider(scrapy.Spider):
name = "vestiaire"
base_url = "https://www.vestiairecollective.com"
rotate_user_agent = True
def start_requests(self):
urls = ["https://www.vestiairecollective.com/men-clothing/jeans/"]
for url in urls:
yield SplashRequest(url=url, callback=self.parse, meta={'args': {"wait": 0.5}})
def parse(self, response):
data = Selector(response)
category_name = data.xpath('//h1[@class="campaign campaign-title clearfix"]/text()').extract_first().strip()
self.log(category_name)
然后我跑蜘蛛:
scrapy爬行测试
请求获取请求网址的403:
2017-12-19 22:55:17 [scrapy.utils.log] INFO:Scrapy 1.4.0开始了 (机器人:爬虫)2017-12-19 22:55:17 [scrapy.utils.log]信息: 重写设置:{'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter','CONCURRENT_REQUESTS':10, 'NEWSPIDER_MODULE':'crawlers.spiders','SPIDER_MODULES': ['crawlers.spiders'],'ROBOTSTXT_OBEY':是的,'COOKIES_ENABLED': 错误,'BOT_NAME':'抓取工具','HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage'} 2017-12-19 22:55:17 [scrapy.middleware]信息:已启用扩展程序: [ 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.corestats.CoreStats'] 2017-12-19 22:55:17 [scrapy.middleware]信息:启用下载中间件: [ 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy_splash.SplashCookiesMiddleware', 'scrapy_splash.SplashMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-19 22:55:17 [scrapy.middleware]信息:启用蜘蛛中间件: [ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-19 22:55:17 [scrapy.middleware]信息:已启用的项目管道: ['scrapy.pipelines.images.ImagesPipeline'] 2017-12-19 22:55:17 [scrapy.core.engine]信息:蜘蛛开启2017-12-19 22:55:17 [scrapy.extensions.logstats]信息:抓取0页(0页/分钟), 刮掉0件(0件/分)2017-12-19 22:55:17 [scrapy.extensions.telnet] DEBUG:正在监听的Telnet控制台 127.0.0.1:6023 2017-12-19 22:55:20 [scrapy.core.engine] DEBUG:Crawled(200)https://www.vestiairecollective.com/robots.txt> (引荐: 无)2017-12-19 22:55:22 [scrapy.core.engine] DEBUG:已抓(403) HTTP://本地主机:8050 / robots.txt的> (参考文献:无)2017-12-19 22:55:23 [scrapy.core.engine] DEBUG:Crawled(403)https://www.vestiairecollective.com/men-clothing/jeans/ via http://localhost:8050/render.html> (推荐人:无)2017-12-19 22:55:23 [scrapy.spidermiddlewares.httperror]信息:忽略响应< 403 https://www.vestiairecollective.com/men-clothing/jeans/>:HTTP状态 代码未处理或不允许2017-12-19 22:55:23 [scrapy.core.engine]信息:关闭蜘蛛(已完成)2017-12-19 22:55:23 [scrapy.statscollectors]信息:倾倒Scrapy统计: {'downloader / request_bytes':1254,'downloader / request_count':3, 'downloader / request_method_count / GET':2, 'downloader / request_method_count / POST':1, 'downloader / response_bytes':2793,'downloader / response_count':3, 'downloader / response_status_count / 200':1, 'downloader / response_status_count / 403':2,'finish_reason': 'finished','finish_time':datetime.datetime(2017,12,19,20,55, 23,440598),'httperror / response_ignored_count':1, 'httperror / response_ignored_status_count / 403':1,'log_count / DEBUG': 4,'log_count / INFO':8,'memusage / max':53850112, 'memusage / startup':53850112,'response_received_count':3, 'scheduler / dequeued':2,'scheduler / dequeued / memory':2, 'scheduler / enqueued':2,'scheduler / enqueued / memory':2, 'splash / render.html / request_count':1, 'splash / render.html / response_count / 403':1,'start_time': datetime.datetime(2017,12,19,20,55,17,372080)} 2017-12-19 22:55:23 [scrapy.core.engine]信息:蜘蛛关闭(完成)
答案 0 :(得分:1)
问题在于User-Agent。许多网站都要求它进行访问。 访问网站并避免禁止的最简单方法是使用此lib随机化用户代理。 https://github.com/cnu/scrapy-random-useragent